Entropik Labs Logo
AI/ML, OCR
Completed

Invoice OCR & Data Extraction System

A Dockerized OCR-based API to extract key details from invoices in English and French, with preprocessing, intelligent correction, and self-learning capabilities.

Client
Confidential Client
Duration
4 weeks
Team Size
4 engineers
Completed
July 2025
Invoice OCR & Data Extraction System

Project Overview

A lightweight, Dockerized OCR system to extract key invoice data from scanned PDFs and images. Designed for dual-language support (English/French), intelligent correction via Gemma LLM, and continuous self-learning.

The Challenge

We needed to ensure accurate data extraction across a variety of invoice layouts, languages, and file types without heavy infrastructure or retraining loops.

Our Solution

Implemented preprocessing using PyTorch and Docling OCR, combined with post-processing logic and a Gemma-based self-learning module. Delivered as a containerized FastAPI solution.

Key Features

🌐

Dual-Language OCR

Supports both English and French invoices using Docling's multilingual capabilities and automatic language detection.

🧼

Preprocessing Pipeline

Enhances OCR quality through contouring, contrast correction, and DPI normalization using PyTorch.

🧠

Self-Learning Correction

Uses Gemma LLM to track corrections, build correction maps, and improve output over time without retraining.

Technical Challenges & Solutions

Variable Document Layouts

Invoices came in a variety of formats, languages, and image qualities.

Solution: Built a layout-agnostic preprocessing and extraction logic using Docling, regex, LLM, and heuristics.

Low OCR Confidence Scores

OCR struggled with uncommon fonts, low DPI, and multilingual content.

Solution: Implemented fallback logic with language detection, spelling correction, and field-level rule checks.

No Retraining Infrastructure

Client needed learning capabilities but lacked infrastructure for retraining models.

Solution: Delivered a correction module based on rule accumulation via user feedback loops.

Project Gallery

Invoice OCR & Data Extraction System - Gallery Image 1
Invoice OCR & Data Extraction System - Gallery Image 2
Invoice OCR & Data Extraction System - Gallery Image 3

Project Timeline

1

Pipeline Implementation & Demo

1 week

Built and demoed complete preprocessing and OCR pipeline

Initial OCR pipeline
Dual-language detection
Field extraction logic
2

Final Delivery

1 week

Finalized containerized deployment with intelligent corrections

Production-ready Docker image
User documentation
Feedback-based refinement
3

Self-Learning Module

2 weeks

Integrated correction-feedback loop for long-term accuracy improvement

Correction map module
Feedback ingestion API
Low-maintenance adaptive logic

Key Results

Improved
Accuracy Improvement
Accuracy of field extraction improved over time via the correction module
Rapid
Deployment Time
Rapid delivery including intelligent corrections and self-learning
Portable
Portability
Fully containerized for use across any cloud/on-prem environment

Technologies Used

FastAPI
Python
Docker
Docling OCR
Gemma LLM
PyTorch
Regex

Before vs After

Extraction Accuracy
Lower Higher
Deployment Portability
Manual setup required Containerized
Processing Speed
Slower Faster

Client Testimonial

"The Entropik team delivered exactly what we needed, a highly adaptable and accurate OCR pipeline with smart correction capabilities. It integrates seamlessly into our stack while consuming minimum resources and cost."
Anonymous Client
Anonymous Client
Tech Lead, Confidential Company

Interested in Our Work?

Let's discuss how we can build a similar solution for your business