AI/ML, OCR

Completed

Invoice OCR & Data Extraction System

A Dockerized OCR-based API to extract key details from invoices in English and French, with preprocessing, intelligent correction, and self-learning capabilities.

Client

Confidential Client

Duration

4 weeks

Team Size

4 engineers

Completed

July 2025

Project Overview

A lightweight, Dockerized OCR system to extract key invoice data from scanned PDFs and images. Designed for dual-language support (English/French), intelligent correction via Gemma LLM, and continuous self-learning.

The Challenge

We needed to ensure accurate data extraction across a variety of invoice layouts, languages, and file types without heavy infrastructure or retraining loops.

Our Solution

Implemented preprocessing using PyTorch and Docling OCR, combined with post-processing logic and a Gemma-based self-learning module. Delivered as a containerized FastAPI solution.

Key Features

🌐

Dual-Language OCR

Supports both English and French invoices using Docling's multilingual capabilities and automatic language detection.

🧼

Preprocessing Pipeline

Enhances OCR quality through contouring, contrast correction, and DPI normalization using PyTorch.

🧠

Self-Learning Correction

Uses Gemma LLM to track corrections, build correction maps, and improve output over time without retraining.

Technical Challenges & Solutions

Variable Document Layouts

Invoices came in a variety of formats, languages, and image qualities.

Solution: Built a layout-agnostic preprocessing and extraction logic using Docling, regex, LLM, and heuristics.

Low OCR Confidence Scores

OCR struggled with uncommon fonts, low DPI, and multilingual content.

Solution: Implemented fallback logic with language detection, spelling correction, and field-level rule checks.

No Retraining Infrastructure

Client needed learning capabilities but lacked infrastructure for retraining models.

Solution: Delivered a correction module based on rule accumulation via user feedback loops.

Project Gallery

Invoice OCR & Data Extraction System - Gallery Image 1

Invoice OCR & Data Extraction System - Gallery Image 2

Invoice OCR & Data Extraction System - Gallery Image 3

Project Timeline

Pipeline Implementation & Demo

1 week

Built and demoed complete preprocessing and OCR pipeline

Initial OCR pipeline

Dual-language detection

Field extraction logic

Final Delivery

1 week

Finalized containerized deployment with intelligent corrections

Production-ready Docker image

User documentation

Feedback-based refinement

Self-Learning Module

2 weeks

Integrated correction-feedback loop for long-term accuracy improvement

Correction map module

Feedback ingestion API

Low-maintenance adaptive logic

Key Results

Improved

Accuracy Improvement

Accuracy of field extraction improved over time via the correction module

Rapid

Deployment Time

Rapid delivery including intelligent corrections and self-learning

Portable

Portability

Fully containerized for use across any cloud/on-prem environment

Technologies Used

FastAPI

Python

Docker

Docling OCR

Gemma LLM

PyTorch

Regex

Before vs After

Extraction Accuracy

Lower→ Higher

Deployment Portability

Manual setup required→ Containerized

Processing Speed

Slower→ Faster

Client Testimonial

"The Entropik team delivered exactly what we needed, a highly adaptable and accurate OCR pipeline with smart correction capabilities. It integrates seamlessly into our stack while consuming minimum resources and cost."

Anonymous Client

Tech Lead, Confidential Company

Interested in Our Work?

Let's discuss how we can build a similar solution for your business