Efficient smart OCR solution for banking document digitization

Loading...
Thumbnail Image

Publisher

BRAC University

Citation

Abstract

The digitization of multilingual banking documents, particularly those containing handwritten Bengali and English scripts, poses significant challenges due to variable handwriting styles, document noise, and domain-specific terminology. This study presents a hybrid Optical Character Recognition (OCR) and language model–based pipeline designed to achieve high-fidelity text extraction and correction for banking document digitization. The proposed system integrates two stateof- the-art OCR architectures—Tesseract, EasyOCR OCR for robust unstructured Raw text extraction and GPT-3.5,LLaMA-2 for end-toend handwritten text recognition—with advanced language models for post-processing. Bengali text correction is performed using Gemma- 7B and BLOOM-7B, while English text is refined through GPT-3.5 and LLaMA-2 (7B-chat). The dataset comprising paired images and annotations for both languages, undergoes preprocessing, binarization ,noise reduction, skew correction and redundancy filtering before model training and evaluation. Experimental results show substantial improvements in linguistic accuracy and semantic preservation compared to baseline OCR outputs, demonstrating the system’s applicability for real-world multilingual banking document digitization.

Description

Cataloged from PDF version of internship report.
Includes bibliographical references (page 48).
This internship report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2025.

Publisher Link

Type

Internship Report