Building an Intelligent Document Extraction System with Vision-Language Models
A deep dive into architecting a production-grade document extraction system that combines traditional OCR with Vision-Language Models to achieve 90%+ accuracy across diverse document formats.
Extracting structured data from documents is one of those problems that seems simple until you try to solve it at scale. After building a production document extraction system that handles invoices, contracts, forms, and reports with 90%+ accuracy, I want to share the architectural decisions and technical insights that made it work.
The Problem
Organizations deal with thousands of documents daily. Invoices, contracts, purchase orders, receipts. Each with different layouts, languages, and quality levels. Traditional OCR falls short when dealing with:
- Complex layouts (multi-column invoices, nested tables)
- Poor quality scans (faded text, skewed pages)
- Mixed content (text, tables, signatures, stamps)
- Format diversity (PDF, DOCX, images, spreadsheets)
The goal was to build a system that extracts structured JSON from any document type with high accuracy and minimal manual intervention.
Architecture Overview
The system uses a hybrid approach combining traditional document processing with Vision-Language Models:
Input Document
|
Type Detection (8 formats supported)
|
Adapter Selection (Strategy Pattern)
|
Primary Processing (Docling OCR)
|
Quality Assessment
|--- High Quality ---> Return Results
|
|--- Low Quality ---> VLM Fallback
|
Image Conversion
|
VLM API Call (GPT-4V/Claude/Gemini/DeepSeek VL)
|
Structured Extraction
|
Return Results
Key Technical Decisions
1. Adapter Pattern for Format Diversity
Supporting 8 document formats (PDF, DOCX, PPTX, XLSX, CSV, Images, HTML, Text) required a flexible architecture. Each format has a dedicated adapter implementing a common interface:
class AbstractAdapter:
def parse(self, file_path: str, **kwargs) -> dict:
raise NotImplementedError
class PDFAdapter(AbstractAdapter):
def parse(self, file_path, **kwargs):
# PDF-specific logic: OCR, page extraction, table detection
...
class ImageAdapter(AbstractAdapter):
def parse(self, file_path, **kwargs):
# Preprocessing: deskew, denoise, enhance
# Then OCR
...
This made adding new formats trivial and kept the core extraction logic clean.
2. Intelligent Quality-Based Routing
Not every document needs expensive VLM processing. The system assesses OCR quality in real-time:
def should_fallback_to_vlm(ocr_result: dict) -> bool:
text_length = len(ocr_result.get("text", ""))
confidence = ocr_result.get("confidence", 0)
# Low text extraction or poor confidence triggers VLM
if text_length < 100:
return True
if confidence < 0.6:
return True
return False
This hybrid approach cuts API costs by 70% while maintaining accuracy on difficult documents.
3. Multi-Provider VLM Support
Relying on a single AI provider is risky. The system supports five VLM providers with automatic failover:
| Provider | Strength | Use Case |
|---|---|---|
| OpenRouter | Multi-model routing | Primary (recommended) |
| GPT-4 Vision | Complex visual understanding | Diagrams, handwriting |
| Claude Vision | JSON consistency | Structured extraction |
| Gemini Vision | Speed, multilingual | High-volume processing |
| DeepSeek VL | Local GPU inference | Self-hosted, privacy-sensitive |
Provider selection happens automatically based on environment configuration:
def detect_provider():
if os.getenv("OPENROUTER_API_KEY"):
return OpenRouterProvider()
elif os.getenv("OPENAI_API_KEY"):
return OpenAIProvider()
elif os.getenv("ANTHROPIC_API_KEY"):
return AnthropicProvider()
elif os.getenv("GOOGLE_API_KEY"):
return GeminiProvider()
elif os.getenv("DEEPSEEK_LOCAL"):
return DeepSeekVLProvider()
raise NoProviderError("No VLM provider configured")
4. The Universal Extraction Prompt
The prompt is the core differentiator. After extensive testing, a 4-phase extraction methodology emerged:
Phase 1: Document Classification Identify document type (invoice, contract, form, report, receipt).
Phase 2: Comprehensive Extraction Extract into 20+ categories:
- Identifiers (document numbers, barcodes)
- Dates (document date, due date, effective date)
- Parties (vendor, customer with full contact info)
- Financial data (line items, totals, taxes)
- Tables (with structure preservation)
- Signatures and stamps
- Full text (verbatim)
Phase 3: Extraction Rules
- Extract every visible element
- Use
nullfor missing fields, not guesses - Preserve exact numbers and formatting
- Flag low-confidence extractions
Phase 4: Contextual Intelligence Document-type-specific rules:
- Invoices: Focus on line items and payment terms
- Contracts: Emphasize clauses and obligations
- Forms: Capture every field-value pair
5. Image Preprocessing Pipeline
Poor quality images are the #1 cause of extraction failures. A preprocessing pipeline handles common issues:
def preprocess_image(image: Image) -> Image:
# Deskew (fix rotation)
angle = detect_skew(image)
if abs(angle) > 0.5:
image = rotate(image, angle)
# Enhance contrast
image = ImageEnhance.Contrast(image).enhance(1.3)
# Sharpen
image = ImageEnhance.Sharpness(image).enhance(1.2)
# Denoise
image = cv2.fastNlMeansDenoising(image)
return image
This preprocessing step alone improved accuracy by 15% on scanned documents.
Data Flow
API Design
The system exposes a simple 3-parameter API:
POST /extract
file: <document>
document_mode: "auto" | "structured" | "unstructured"
prefer_vlm: boolean
Response schema:
{
"text": "Full extracted text...",
"tables": [
{
"data": [["Header1", "Header2"], ["Value1", "Value2"]],
"page_number": 1
}
],
"metadata": {
"file_name": "invoice.pdf",
"file_type": "pdf",
"num_pages": 2,
"extraction_timestamp": "2025-01-15T10:30:00Z"
}
}
Document Mode Routing
Three processing modes handle different document types:
- structured: Forms, invoices, simple documents. Uses traditional OCR.
- unstructured: Complex reports, flowcharts. Uses DeepSeek VL with complexity analysis.
- auto: Analyzes document and routes automatically.
Performance Optimizations
GPU Acceleration
For high-volume processing, vLLM provides efficient model serving:
# vLLM configuration for DeepSeek VL
model = vllm.LLM(
model_name="deepseek-ai/deepseek-vl-7b-chat",
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
max_model_len=4096
)
Batch Processing
Documents are batched for throughput:
BATCH_SIZE = 16 # Configurable based on GPU memory
async def process_batch(documents: List[Document]):
# Convert all to images
images = [doc_to_image(doc) for doc in documents]
# Single VLM call with batch
results = await vlm.extract_batch(images)
return results
Complexity Analysis
Before processing, the system analyzes document complexity to choose the optimal extraction strategy:
def analyze_complexity(image: Image) -> ComplexityLevel:
# Edge detection for structure
edges = cv2.Canny(image, 50, 150)
# Contour detection for diagrams
contours = cv2.findContours(edges)
# Table grid detection
horizontal_lines = detect_lines(image, "horizontal")
vertical_lines = detect_lines(image, "vertical")
# Score complexity
if len(contours) > 100 or has_tables:
return ComplexityLevel.COMPLEX
return ComplexityLevel.SIMPLE
Simple documents get fast OCR processing. Complex documents with tables, diagrams, or dense layouts are routed to DeepSeek VL for better accuracy.
Results
The system achieves:
| Metric | Value |
|---|---|
| Extraction Accuracy | 90%+ |
| Processing Time (single page) | < 3 seconds |
| Supported Formats | 8 |
| VLM Providers | 5 (with failover) |
| Test Coverage | 272 test cases |
Tech Stack
| Layer | Technology |
|---|---|
| API Framework | FastAPI with async/await |
| Document Processing | Docling, PyMuPDF, python-docx |
| OCR | Tesseract, DeepSeek VL |
| VLM | OpenRouter, GPT-4V, Claude, Gemini |
| Image Processing | OpenCV, Pillow, scipy |
| Deployment | Docker with GPU support |
Lessons Learned
-
Hybrid beats pure AI: Traditional OCR for 80% of documents, VLM for the hard 20%. Cost-effective and accurate.
-
Preprocessing matters: Deskewing and denoising improve accuracy more than model upgrades.
-
Prompts are code: The extraction prompt is versioned, tested, and optimized like any critical code.
-
Graceful degradation: When VLM fails, fall back to OCR results. Something is better than nothing.
-
Multi-provider resilience: API outages happen. Support multiple providers with automatic failover.
Conclusion
Building a production document extraction system requires balancing accuracy, cost, and reliability. The hybrid approach of combining traditional OCR with Vision-Language Models provides the best of both worlds: speed and cost efficiency for standard documents, with AI-powered extraction for complex cases.
The key is intelligent routing. Not every document needs GPT-4 Vision. Quality assessment and automatic routing cut costs while maintaining accuracy where it matters.
This system was built for enterprise document processing, handling invoices, contracts, and forms across multiple industries.