Local AI Document Intelligence Finance Automation

Kill Your OCR Pipeline

How a local vision LLM turns messy invoice PDFs into clean spreadsheet data—without sending a single byte off your network.

12 min read · Qwen3-VL-8B · LM Studio · Python

Invoices are "structured" the way a teenager's bedroom is "organized"—everything is technically there, but good luck finding it programmatically.

They arrive as PDFs with wildly inconsistent layouts. Nested tables sit next to logos. Tax rows hide below fold lines. Vendor A puts the total top-right; Vendor B buries it on page two. And somewhere in the middle, your AP team is copy-pasting line items into a spreadsheet, because the "automation" you built three years ago keeps breaking.

There's a better way now. Vision-capable LLMs can look at an invoice image the way a human does—and extract structured data from it directly. No OCR chain. No template matching. No regex graveyard.

This article walks through a fully local system that does exactly this, powered by Qwen3-VL-8B, running on your own hardware. Every byte of financial data stays inside your network.

Source CodeGitHub repo Docker ImageOne-command install
· · ·

The Problem With "Structured" Documents

If you've ever tried to automate invoice processing, you've hit the same wall everyone else has. Invoices are structured for human eyes, not machines. The "structure" is entirely visual: columns implied by alignment, totals associated with labels by proximity, line items grouped by whitespace.

Here's what makes them brutal for automation:

Why invoices break traditional parsers
Every vendor uses a different layout. Templates change without warning. Tables are visual, not semantic—there are no actual <table> tags in a PDF. Pages may be scanned, rotated, multi-page, or image-only. Stamps, logos, and handwritten notes add noise everywhere.

Rule-based systems and template matchers work fine—until they don't. And they stop working exactly when it matters: when you onboard a new vendor, when someone updates their invoice template, or when a batch of scanned PDFs lands in your inbox.

OCR Pipelines vs. Vision LLMs

The standard approach for years has been an OCR-first pipeline. Convert the PDF to text, then throw it at a language model. It works—sort of. But it has a fundamental flaw: OCR destroys the one thing that makes invoices readable: layout.

THE OLD WAY — OCR PIPELINE 📄 PDF Invoice input 🔍 OCR Tesseract / etc. 📝 Plain Text ⚠ Layout lost 🤖 Text LLM Guesses structure ❌ Fragile JSON Errors compound THE NEW WAY — VISION LLM 📄 PDF Invoice input 🖼️ PNG Image ✓ Layout intact 👁️ Vision LLM Sees the page ✅ Clean JSON Structured output

The OCR pipeline loses spatial information at step 2. The vision pipeline preserves it end-to-end.

When OCR converts a PDF to plain text, it flattens the page. Row/column alignment disappears. A line item that was clearly in a table now looks like a random string next to a number. The LLM downstream has to guess what was a column header and what was a value—and it guesses wrong more often than you'd like.

A vision model skips this entirely. It takes a page image as input, sees the spatial relationships, and reasons over them directly. Table boundaries. Column alignment. The way a total sits beneath a column of numbers. The spatial logic that makes an invoice readable to you is exactly what the model uses too.

OCR Pipeline

  • Layout destroyed during extraction
  • Table rows/columns misaligned
  • Small text & stamps cause noise
  • Needs post-processing heuristics
  • Vendor-specific tuning required

Vision LLM

  • Layout preserved as visual input
  • Table geometry directly understood
  • Handles noise naturally
  • Schema-driven clean output
  • Vendor-agnostic by design
· · ·

Why This Has to Run Locally

Here's the part most AI demos conveniently skip: invoices are sensitive documents.

They contain vendor banking details, tax IDs, internal cost centers, payment terms, and sometimes PII. Sending them to an external API—even a reputable one—opens a can of compliance worms that most finance teams won't touch.

The compliance question
Every external API call with financial data triggers the same questions: Where is the data stored? Is it logged? Is it used for training? What's the retention policy? How does this affect SOC 2, ISO 27001, GDPR, or vendor agreements?

Running inference locally eliminates the entire category of risk. Your documents never leave your network. There's no third-party data processor. Retention is whatever you decide it is. And your compliance team doesn't need to review yet another vendor's DPA.

CLOUD API — DATA LEAVES YOUR NETWORK 🏢 Your Network Financial documents INTERNET ☁️ External API Stored? Logged? Trained? ❓ Unknown infra SOC 2? GDPR? Retention? LOCAL INFERENCE — DATA NEVER LEAVES 📄 Documents Your invoices 🧠 Local Model Qwen3-VL-8B 📊 Structured Data Your storage 🔒 ENTIRE PROCESS STAYS WITHIN YOUR SECURITY BOUNDARY

Cloud APIs create a compliance surface. Local inference eliminates it entirely.

For many organizations, this is the difference between "interesting proof of concept" and "something we can actually deploy."

· · ·

The Model: Qwen3-VL-8B

Qwen3-VL-8B is an open multimodal model that can process image inputs and generate structured text outputs. At 8B parameters, it sits in the sweet spot for local deployment: powerful enough for reliable document extraction, small enough to run on a single GPU with quantization.

The setup is straightforward: host the model locally via LM Studio, which exposes an OpenAI-compatible API endpoint. Use 4-bit or 8-bit quantization depending on your hardware. Your application talks to it like any other API—except the endpoint is localhost.

What makes it work well for invoices specifically:

Capability Why it matters for invoices
Document layout understanding Correctly identifies table boundaries, column headers, and spatial groupings—even in messy scanned documents
JSON schema adherence When prompted with a target schema, reliably produces valid, parseable JSON instead of prose descriptions
Quantization tolerance 4-bit quantized versions maintain strong extraction quality, enabling deployment on consumer GPUs (16GB VRAM)
Inference speed Processes a typical single-page invoice in seconds, making batch workflows practical
· · ·

How It Works: End to End

The full pipeline is intentionally simple. Simplicity is a feature—fewer moving parts means fewer failure modes.

The Pipeline 📤 Upload PDFs Drag & drop, single or batch 1 🖼️ Convert to PNG Each page → image 2 👁️ Vision LLM Qwen3-VL-8B (local) 3 { } Structured JSON Scalars + arrays 4 🔍 Review & Validate Interactive table • expandable line items • visual checks 5 📊 Export to Excel / CSV Overview sheet + Line Items sheet • linked by invoice ID 6 By normalizing all input to images, the system avoids edge cases tied to PDF internals. The pipeline works identically for scanned, digital, rotated, or multi-page PDFs.

Six steps from messy PDFs to clean spreadsheets. The model does the hard work at step 3.

A key design choice: converting PDFs to PNG images before inference. This normalizes the input. It doesn't matter if the PDF is natively digital, scanned, rotated, or generated by 15 different accounting packages. The model always gets the same kind of input: a page image.

Here's what the actual application looks like:

localhost:8501
Invoice Processing Application — upload, extract, review, and export invoice data

The invoice extraction app: upload PDFs on the left, review structured results on the right, export to Excel.

· · ·

Schema-Driven Extraction

The extraction isn't freeform. The model is prompted with an explicit target schema, and it returns data shaped to match. This eliminates the ambiguity that plagues open-ended prompts and makes downstream processing predictable.

The schema has two types of fields:

Scalar Fields

  • vendor_name
  • invoice_number
  • invoice_date
  • total_amount
  • currency

Array Fields

  • line_items[].description
  • line_items[].quantity
  • line_items[].unit_price
  • line_items[].amount
prompt snippet // The model receives this schema as part of the prompt // and returns data that matches it exactly { "vendor_name": "string", "invoice_number": "string", "invoice_date": "string (YYYY-MM-DD)", "total_amount": "number", "currency": "string (ISO 4217)", "line_items": [ { "description": "string", "quantity": "number", "unit_price": "number", "amount": "number" } ] }

This schema-driven approach means you can extend the system to new document types by changing the schema—not the code. Purchase orders, shipping documents, contracts: same pipeline, different schema.

· · ·

Beyond Invoices: A Local Document Intelligence Layer

Here's the bigger picture. This system isn't really an "invoice tool." It's a pattern: local multimodal model + schema prompt + structured output = general document intelligence.

One model, many document workflows. The same architecture that extracts invoices today handles purchase orders, contracts, and compliance forms tomorrow.

Instead of building and maintaining vendor-specific parsers, regex pipelines, and template-matching logic, you deploy one multimodal model as shared infrastructure. The same model supports:

🧾Invoices
📋Purchase Orders
📑Contracts
🚚Shipping Docs
Compliance Forms
📈Internal Reports
· · ·

Operational Tradeoffs (Honest Assessment)

Local inference isn't free. You're trading cloud convenience for control. Here's what that looks like in practice:

Dimension Local Inference Cloud API
Data privacy Full control, nothing leaves your network Depends on vendor policies
Hardware GPU recommended (16GB+ VRAM) None required
Maintenance You own model updates, monitoring, tuning Vendor-managed
Cost model Fixed (hardware), predictable Variable (per-token), can spike
Compliance Simplified—no third-party DPA needed Requires vendor security review
Scaling Limited by hardware, add GPUs to scale Effectively unlimited

For teams processing hundreds or low thousands of invoices per month with sensitive financial data, local inference is often the better fit. If you're processing millions of documents and data sensitivity isn't a concern, cloud APIs may make more sense.

· · ·

Try It Yourself

The fastest way to get running is the Docker image. It packages the entire web app—you just need a local LLM endpoint (like LM Studio) running alongside it.

Prerequisites: Docker installed on your machine, and a local vision LLM running (e.g. Qwen3-VL-8B via LM Studio). Make sure your LM Studio server is started and accessible.
1

Pull the image

docker pull klaushofenbitzer/invoice_processing
2

Run the container

docker run -p 8501:8501 \ --add-host=host.docker.internal:host-gateway \ klaushofenbitzer/invoice_processing

The --add-host flag lets the container reach LM Studio on your host machine.

3

Open the app

Navigate to http://localhost:8501 — drag in an invoice and go.

That's it. Three commands, and you have a working invoice extraction system running entirely on your local machine.

Want to customize, extend the schema, or dig into the code? The full source is on GitHub:

Browse the Sourcegithub.com/khofenbitzer/invoice_processing Docker Hubhub.docker.com/r/klaushofenbitzer/invoice_processing
· · ·

The Shift That Matters

We're moving into a world where organizations deploy internal AI infrastructure rather than just consuming external APIs. Document processing is one of the highest-ROI entry points because the documents are messy, the data is sensitive, the automation value is immediate, and the accuracy requirements are clear and measurable.

Local multimodal models make this possible: AI capabilities inside the security boundary, not outside it.

Invoice extraction happens to be the most obvious place to start. But the architecture—local vision model, schema-driven prompts, structured output, human review—applies far more broadly. If you have documents that humans currently read and transcribe, this pattern can probably help.

TL;DR
Convert PDFs to images. Send them to a local vision LLM (Qwen3-VL-8B via LM Studio). Prompt with a target schema. Get structured JSON back. Review. Export to Excel. No data leaves your network. No vendor lock-in. No OCR pipeline to babysit.