Introduction
Document processing is one of the highest-ROI applications of AI in business. Organizations spend countless hours manually extracting data from invoices, contracts, forms, and reports. AI-powered document processing can handle these tasks in seconds, with higher accuracy and lower cost than human operators.
The Document Processing Pipeline
A complete document processing system has five stages:
1. Document Ingestion
Documents arrive in various formats:
Each format requires different preprocessing:
def preprocess_document(file_path):
ext = Path(file_path).suffix.lower()
if ext == ".pdf":
images = pdf_to_images(file_path, dpi=300)
text = pdf_to_text(file_path) # For digital PDFs
return {"images": images, "text": text, "type": "pdf"}
elif ext in (".jpg", ".jpeg", ".png", ".tiff"):
image = enhance_image(file_path) # Denoise, deskew, enhance contrast
return {"images": [image], "type": "image"}
elif ext == ".docx":
text = docx_to_text(file_path)
return {"text": text, "type": "docx"}
else:
raise ValueError(f"Unsupported format: {ext}")
2. Optical Character Recognition (OCR)
For scanned documents and images, OCR converts visual text to machine-readable text:
import pytesseract
from PIL import Image
def ocr_document(image_path):
image = Image.open(image_path)
# Configure OCR for better accuracy
custom_config = r'--oem 3 --psm 6 -l eng'
data = pytesseract.image_to_data(
image,
config=custom_config,
output_type=pytesseract.Output.DICT
)
return {
"full_text": pytesseract.image_to_string(image, config=custom_config),
"words": data["text"],
"positions": list(zip(data["left"], data["top"], data["width"], data["height"]))
}
**Modern OCR alternatives**:
3. Document Classification
Classify documents before extraction to route them to the correct pipeline:
def classify_document(text):
categories = [
"invoice", "contract", "resume", "receipt",
"medical_record", "legal_filing", "report", "other"
]
classification = call_llm(f"""
Classify this document into exactly one category: {', '.join(categories)}
Respond with only the category name.
Document text:
{text[:2000]}
""")
confidence = extract_confidence(classification)
return classification, confidence
4. Data Extraction
Extract structured data from documents using schema-driven prompts:
def extract_invoice_data(text):
schema = {
"invoice_number": "string",
"date": "date (YYYY-MM-DD)",
"vendor_name": "string",
"vendor_address": "string",
"customer_name": "string",
"line_items": ["description", "quantity", "unit_price", "total"],
"subtotal": "number",
"tax": "number",
"total": "number",
"currency": "string"
}
extraction = call_llm(f"""
Extract the following fields from this invoice text.
Return ONLY valid JSON matching this schema:
{json.dumps(schema, indent=2)}
Invoice text:
{text}
If a field is not found, use null. Do not guess values.
""")
return json.loads(extraction)
**Multimodal extraction** with vision-capable LLMs (GPT-4o, Claude 3.5) can process document images directly, bypassing OCR:
def extract_from_image(image_path, schema):
import base64
with open(image_path, "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": f"Extract data from this document image. Return JSON matching: {json.dumps(schema)}"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
]
}]
)
return json.loads(response.choices[0].message.content)
5. Validation and Export
Validate extracted data against business rules before export:
def validate_extraction(data, schema):
errors = []
for field, rules in schema.items():
if rules.get("required") and data.get(field) is None:
errors.append(f"Missing required field: {field}")
if "pattern" in rules and data.get(field):
if not re.match(rules["pattern"], str(data[field])):
errors.append(f"Field {field} fails pattern validation")
return errors
Production Architecture
Documents → Queue → Worker Pool → Storage
↓
Classification Router
↙ ↓ ↘
Invoice Contract Report
Pipeline Pipeline Pipeline
↙ ↓ ↘
Extraction → Validation → Export → Database
↓
Exception Queue
↓
Human Review
Key components:
Handling Edge Cases
Measuring Accuracy
Track these metrics per document type:
Conclusion
AI document processing transforms document-heavy workflows from hours of manual work to seconds of automated processing. The key to success is building a pipeline that handles format diversity, uses the right OCR/extraction approach for each document type, and includes robust validation with human review for edge cases. Start with a single document type (like invoices), perfect the pipeline, then expand to additional types.