Introduction


Document processing is one of the highest-ROI applications of AI in business. Organizations spend countless hours manually extracting data from invoices, contracts, forms, and reports. AI-powered document processing can handle these tasks in seconds, with higher accuracy and lower cost than human operators.


The Document Processing Pipeline


A complete document processing system has five stages:


1. Document Ingestion


Documents arrive in various formats:


  • **PDFs**: Scanned images, digital PDFs, fillable forms
  • **Images**: JPEG, PNG, TIFF from phone cameras or scanners
  • **Office formats**: DOCX, XLSX with embedded data
  • **HTML/emails**: Web content and email attachments

  • Each format requires different preprocessing:


    
    def preprocess_document(file_path):
    
        ext = Path(file_path).suffix.lower()
    
    
    
        if ext == ".pdf":
    
            images = pdf_to_images(file_path, dpi=300)
    
            text = pdf_to_text(file_path)  # For digital PDFs
    
            return {"images": images, "text": text, "type": "pdf"}
    
    
    
        elif ext in (".jpg", ".jpeg", ".png", ".tiff"):
    
            image = enhance_image(file_path)  # Denoise, deskew, enhance contrast
    
            return {"images": [image], "type": "image"}
    
    
    
        elif ext == ".docx":
    
            text = docx_to_text(file_path)
    
            return {"text": text, "type": "docx"}
    
    
    
        else:
    
            raise ValueError(f"Unsupported format: {ext}")
    
    

    2. Optical Character Recognition (OCR)


    For scanned documents and images, OCR converts visual text to machine-readable text:


    
    import pytesseract
    
    from PIL import Image
    
    
    
    def ocr_document(image_path):
    
        image = Image.open(image_path)
    
        # Configure OCR for better accuracy
    
        custom_config = r'--oem 3 --psm 6 -l eng'
    
        data = pytesseract.image_to_data(
    
            image,
    
            config=custom_config,
    
            output_type=pytesseract.Output.DICT
    
        )
    
        return {
    
            "full_text": pytesseract.image_to_string(image, config=custom_config),
    
            "words": data["text"],
    
            "positions": list(zip(data["left"], data["top"], data["width"], data["height"]))
    
        }
    
    

    **Modern OCR alternatives**:

  • **Azure Document Intelligence**: Best-in-class for structured documents (invoices, receipts)
  • **Google Document AI**: Strong general-purpose OCR with entity extraction
  • **Tesseract + Post-processing**: Free, but requires cleanup for quality results

  • 3. Document Classification


    Classify documents before extraction to route them to the correct pipeline:


    
    def classify_document(text):
    
        categories = [
    
            "invoice", "contract", "resume", "receipt",
    
            "medical_record", "legal_filing", "report", "other"
    
        ]
    
    
    
        classification = call_llm(f"""
    
        Classify this document into exactly one category: {', '.join(categories)}
    
        Respond with only the category name.
    
    
    
        Document text:
    
        {text[:2000]}
    
        """)
    
    
    
        confidence = extract_confidence(classification)
    
        return classification, confidence
    
    

    4. Data Extraction


    Extract structured data from documents using schema-driven prompts:


    
    def extract_invoice_data(text):
    
        schema = {
    
            "invoice_number": "string",
    
            "date": "date (YYYY-MM-DD)",
    
            "vendor_name": "string",
    
            "vendor_address": "string",
    
            "customer_name": "string",
    
            "line_items": ["description", "quantity", "unit_price", "total"],
    
            "subtotal": "number",
    
            "tax": "number",
    
            "total": "number",
    
            "currency": "string"
    
        }
    
    
    
        extraction = call_llm(f"""
    
        Extract the following fields from this invoice text.
    
        Return ONLY valid JSON matching this schema:
    
        {json.dumps(schema, indent=2)}
    
    
    
        Invoice text:
    
        {text}
    
    
    
        If a field is not found, use null. Do not guess values.
    
        """)
    
    
    
        return json.loads(extraction)
    
    

    **Multimodal extraction** with vision-capable LLMs (GPT-4o, Claude 3.5) can process document images directly, bypassing OCR:


    
    def extract_from_image(image_path, schema):
    
        import base64
    
    
    
        with open(image_path, "rb") as f:
    
            image_b64 = base64.b64encode(f.read()).decode()
    
    
    
        response = client.chat.completions.create(
    
            model="gpt-4o",
    
            messages=[{
    
                "role": "user",
    
                "content": [
    
                    {"type": "text", "text": f"Extract data from this document image. Return JSON matching: {json.dumps(schema)}"},
    
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
    
                ]
    
            }]
    
        )
    
        return json.loads(response.choices[0].message.content)
    
    

    5. Validation and Export


    Validate extracted data against business rules before export:


    
    def validate_extraction(data, schema):
    
        errors = []
    
        for field, rules in schema.items():
    
            if rules.get("required") and data.get(field) is None:
    
                errors.append(f"Missing required field: {field}")
    
            if "pattern" in rules and data.get(field):
    
                if not re.match(rules["pattern"], str(data[field])):
    
                    errors.append(f"Field {field} fails pattern validation")
    
        return errors
    
    

    Production Architecture


    
    Documents → Queue → Worker Pool → Storage
    
                             ↓
    
                      Classification Router
    
                        ↙    ↓    ↘
    
                  Invoice  Contract  Report
    
                  Pipeline  Pipeline  Pipeline
    
                        ↙    ↓    ↘
    
                  Extraction → Validation → Export → Database
    
                                            ↓
    
                                      Exception Queue
    
                                            ↓
    
                                      Human Review
    
    

    Key components:

  • **Document queue**: SQS, RabbitMQ, or Redis for managing processing load
  • **Worker pool**: Auto-scaling workers for parallel processing
  • **Exception queue**: Documents with low confidence or validation errors
  • **Human review interface**: Dashboard for manual review of exceptions

  • Handling Edge Cases


  • **Poor quality scans**: Apply image enhancement (deskew, denoise, contrast adjustment)
  • **Multi-language documents**: Use language detection and route to appropriate model
  • **Handwritten text**: Requires specialized handwriting recognition (Azure, Google)
  • **Tables and forms**: Structure-aware extraction using layout understanding
  • **Very long documents**: Chunk and process section by section, then merge results

  • Measuring Accuracy


    Track these metrics per document type:


  • **Field-level accuracy**: Correct extractions / total fields
  • **Document-level accuracy**: Perfect extractions / total documents
  • **Rejection rate**: Documents sent to human review
  • **Time savings**: Manual processing time vs AI processing time

  • Conclusion


    AI document processing transforms document-heavy workflows from hours of manual work to seconds of automated processing. The key to success is building a pipeline that handles format diversity, uses the right OCR/extraction approach for each document type, and includes robust validation with human review for edge cases. Start with a single document type (like invoices), perfect the pipeline, then expand to additional types.