Building an Intelligent Document Extraction System with Vision-Language Models

Extracting structured data from documents is one of those problems that seems simple until you try to solve it at scale. After building a production document extraction system that handles invoices, contracts, forms, and reports with 90%+ accuracy, I want to share the architectural decisions and technical insights that made it work.

The Problem

Organizations deal with thousands of documents daily. Invoices, contracts, purchase orders, receipts. Each with different layouts, languages, and quality levels. Traditional OCR falls short when dealing with:

Complex layouts (multi-column invoices, nested tables)
Poor quality scans (faded text, skewed pages)
Mixed content (text, tables, signatures, stamps)
Format diversity (PDF, DOCX, images, spreadsheets)

The goal was to build a system that extracts structured JSON from any document type with high accuracy and minimal manual intervention.

Architecture Overview

The system uses a hybrid approach combining traditional document processing with Vision-Language Models:

Input Document
    |
Type Detection (8 formats supported)
    |
Adapter Selection (Strategy Pattern)
    |
Primary Processing (Docling OCR)
    |
Quality Assessment
    |--- High Quality ---> Return Results
    |
    |--- Low Quality ---> VLM Fallback
                              |
                          Image Conversion
                              |
                          VLM API Call (GPT-4V/Claude/Gemini/DeepSeek VL)
                              |
                          Structured Extraction
                              |
                          Return Results

Key Technical Decisions

1. Adapter Pattern for Format Diversity

Supporting 8 document formats (PDF, DOCX, PPTX, XLSX, CSV, Images, HTML, Text) required a flexible architecture. Each format has a dedicated adapter implementing a common interface:

class AbstractAdapter:
    def parse(self, file_path: str, **kwargs) -> dict:
        raise NotImplementedError

class PDFAdapter(AbstractAdapter):
    def parse(self, file_path, **kwargs):
        # PDF-specific logic: OCR, page extraction, table detection
        ...

class ImageAdapter(AbstractAdapter):
    def parse(self, file_path, **kwargs):
        # Preprocessing: deskew, denoise, enhance
        # Then OCR
        ...

This made adding new formats trivial and kept the core extraction logic clean.

2. Intelligent Quality-Based Routing

Not every document needs expensive VLM processing. The system assesses OCR quality in real-time:

def should_fallback_to_vlm(ocr_result: dict) -> bool:
    text_length = len(ocr_result.get("text", ""))
    confidence = ocr_result.get("confidence", 0)

    # Low text extraction or poor confidence triggers VLM
    if text_length < 100:
        return True
    if confidence < 0.6:
        return True
    return False

This hybrid approach cuts API costs by 70% while maintaining accuracy on difficult documents.

3. Multi-Provider VLM Support

Relying on a single AI provider is risky. The system supports five VLM providers with automatic failover:

Provider	Strength	Use Case
OpenRouter	Multi-model routing	Primary (recommended)
GPT-4 Vision	Complex visual understanding	Diagrams, handwriting
Claude Vision	JSON consistency	Structured extraction
Gemini Vision	Speed, multilingual	High-volume processing
DeepSeek VL	Local GPU inference	Self-hosted, privacy-sensitive

Provider selection happens automatically based on environment configuration:

def detect_provider():
    if os.getenv("OPENROUTER_API_KEY"):
        return OpenRouterProvider()
    elif os.getenv("OPENAI_API_KEY"):
        return OpenAIProvider()
    elif os.getenv("ANTHROPIC_API_KEY"):
        return AnthropicProvider()
    elif os.getenv("GOOGLE_API_KEY"):
        return GeminiProvider()
    elif os.getenv("DEEPSEEK_LOCAL"):
        return DeepSeekVLProvider()
    raise NoProviderError("No VLM provider configured")

4. The Universal Extraction Prompt

The prompt is the core differentiator. After extensive testing, a 4-phase extraction methodology emerged:

Phase 1: Document Classification Identify document type (invoice, contract, form, report, receipt).

Phase 2: Comprehensive Extraction Extract into 20+ categories:

Identifiers (document numbers, barcodes)
Dates (document date, due date, effective date)
Parties (vendor, customer with full contact info)
Financial data (line items, totals, taxes)
Tables (with structure preservation)
Signatures and stamps
Full text (verbatim)

Phase 3: Extraction Rules

Extract every visible element
Use null for missing fields, not guesses
Preserve exact numbers and formatting
Flag low-confidence extractions

Phase 4: Contextual Intelligence Document-type-specific rules:

Invoices: Focus on line items and payment terms
Contracts: Emphasize clauses and obligations
Forms: Capture every field-value pair

5. Image Preprocessing Pipeline

Poor quality images are the #1 cause of extraction failures. A preprocessing pipeline handles common issues:

def preprocess_image(image: Image) -> Image:
    # Deskew (fix rotation)
    angle = detect_skew(image)
    if abs(angle) > 0.5:
        image = rotate(image, angle)

    # Enhance contrast
    image = ImageEnhance.Contrast(image).enhance(1.3)

    # Sharpen
    image = ImageEnhance.Sharpness(image).enhance(1.2)

    # Denoise
    image = cv2.fastNlMeansDenoising(image)

    return image

This preprocessing step alone improved accuracy by 15% on scanned documents.

Data Flow

API Design

The system exposes a simple 3-parameter API:

POST /extract
  file: <document>
  document_mode: "auto" | "structured" | "unstructured"
  prefer_vlm: boolean

Response schema:

{
  "text": "Full extracted text...",
  "tables": [
    {
      "data": [
        ["Header1", "Header2"],
        ["Value1", "Value2"]
      ],
      "page_number": 1
    }
  ],
  "metadata": {
    "file_name": "invoice.pdf",
    "file_type": "pdf",
    "num_pages": 2,
    "extraction_timestamp": "2025-01-15T10:30:00Z"
  }
}

Document Mode Routing

Three processing modes handle different document types:

structured: Forms, invoices, simple documents. Uses traditional OCR.
unstructured: Complex reports, flowcharts. Uses DeepSeek VL with complexity analysis.
auto: Analyzes document and routes automatically.

Performance Optimizations

GPU Acceleration

For high-volume processing, vLLM provides efficient model serving:

# vLLM configuration for DeepSeek VL
model = vllm.LLM(
    model_name="deepseek-ai/deepseek-vl-7b-chat",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

Batch Processing

Documents are batched for throughput:

BATCH_SIZE = 16  # Configurable based on GPU memory

async def process_batch(documents: List[Document]):
    # Convert all to images
    images = [doc_to_image(doc) for doc in documents]

    # Single VLM call with batch
    results = await vlm.extract_batch(images)

    return results

Complexity Analysis

Before processing, the system analyzes document complexity to choose the optimal extraction strategy:

def analyze_complexity(image: Image) -> ComplexityLevel:
    # Edge detection for structure
    edges = cv2.Canny(image, 50, 150)

    # Contour detection for diagrams
    contours = cv2.findContours(edges)

    # Table grid detection
    horizontal_lines = detect_lines(image, "horizontal")
    vertical_lines = detect_lines(image, "vertical")

    # Score complexity
    if len(contours) > 100 or has_tables:
        return ComplexityLevel.COMPLEX
    return ComplexityLevel.SIMPLE

Simple documents get fast OCR processing. Complex documents with tables, diagrams, or dense layouts are routed to DeepSeek VL for better accuracy.

Results

The system achieves:

Metric	Value
Extraction Accuracy	90%+
Processing Time (single page)	< 3 seconds
Supported Formats	8
VLM Providers	5 (with failover)
Test Coverage	272 test cases

Tech Stack

Layer	Technology
API Framework	FastAPI with async/await
Document Processing	Docling, PyMuPDF, python-docx
OCR	Tesseract, DeepSeek VL
VLM	OpenRouter, GPT-4V, Claude, Gemini
Image Processing	OpenCV, Pillow, scipy
Deployment	Docker with GPU support

Lessons Learned

Hybrid beats pure AI: Traditional OCR for 80% of documents, VLM for the hard 20%. Cost-effective and accurate.
Preprocessing matters: Deskewing and denoising improve accuracy more than model upgrades.
Prompts are code: The extraction prompt is versioned, tested, and optimized like any critical code.
Graceful degradation: When VLM fails, fall back to OCR results. Something is better than nothing.
Multi-provider resilience: API outages happen. Support multiple providers with automatic failover.

Conclusion

Building a production document extraction system requires balancing accuracy, cost, and reliability. The hybrid approach of combining traditional OCR with Vision-Language Models provides the best of both worlds: speed and cost efficiency for standard documents, with AI-powered extraction for complex cases.

The key is intelligent routing. Not every document needs GPT-4 Vision. Quality assessment and automatic routing cut costs while maintaining accuracy where it matters.

This system was built for enterprise document processing, handling invoices, contracts, and forms across multiple industries.