Kill Your OCR Pipeline — Local Vision LLMs for Invoice Extraction

Invoices are "structured" the way a teenager's bedroom is "organized"—everything is technically there, but good luck finding it programmatically.

They arrive as PDFs with wildly inconsistent layouts. Nested tables sit next to logos. Tax rows hide below fold lines. Vendor A puts the total top-right; Vendor B buries it on page two. And somewhere in the middle, your AP team is copy-pasting line items into a spreadsheet, because the "automation" you built three years ago keeps breaking.

There's a better way now. Vision-capable LLMs can look at an invoice image the way a human does—and extract structured data from it directly. No OCR chain. No template matching. No regex graveyard.

This article walks through a fully local system that does exactly this, powered by Qwen3-VL-8B, running on your own hardware. Every byte of financial data stays inside your network.

· · ·

The Problem With "Structured" Documents

If you've ever tried to automate invoice processing, you've hit the same wall everyone else has. Invoices are structured for human eyes, not machines. The "structure" is entirely visual: columns implied by alignment, totals associated with labels by proximity, line items grouped by whitespace.

Here's what makes them brutal for automation:

Why invoices break traditional parsers

Every vendor uses a different layout. Templates change without warning. Tables are visual, not semantic—there are no actual <table> tags in a PDF. Pages may be scanned, rotated, multi-page, or image-only. Stamps, logos, and handwritten notes add noise everywhere.

Rule-based systems and template matchers work fine—until they don't. And they stop working exactly when it matters: when you onboard a new vendor, when someone updates their invoice template, or when a batch of scanned PDFs lands in your inbox.

OCR Pipelines vs. Vision LLMs

The standard approach for years has been an OCR-first pipeline. Convert the PDF to text, then throw it at a language model. It works—sort of. But it has a fundamental flaw: OCR destroys the one thing that makes invoices readable: layout.

The OCR pipeline loses spatial information at step 2. The vision pipeline preserves it end-to-end.

When OCR converts a PDF to plain text, it flattens the page. Row/column alignment disappears. A line item that was clearly in a table now looks like a random string next to a number. The LLM downstream has to guess what was a column header and what was a value—and it guesses wrong more often than you'd like.

A vision model skips this entirely. It takes a page image as input, sees the spatial relationships, and reasons over them directly. Table boundaries. Column alignment. The way a total sits beneath a column of numbers. The spatial logic that makes an invoice readable to you is exactly what the model uses too.

OCR Pipeline

Layout destroyed during extraction
Table rows/columns misaligned
Small text & stamps cause noise
Needs post-processing heuristics
Vendor-specific tuning required

Vision LLM

Layout preserved as visual input
Table geometry directly understood
Handles noise naturally
Schema-driven clean output
Vendor-agnostic by design

· · ·

Why This Has to Run Locally

Here's the part most AI demos conveniently skip: invoices are sensitive documents.

They contain vendor banking details, tax IDs, internal cost centers, payment terms, and sometimes PII. Sending them to an external API—even a reputable one—opens a can of compliance worms that most finance teams won't touch.

The compliance question

Every external API call with financial data triggers the same questions: Where is the data stored? Is it logged? Is it used for training? What's the retention policy? How does this affect SOC 2, ISO 27001, GDPR, or vendor agreements?

Running inference locally eliminates the entire category of risk. Your documents never leave your network. There's no third-party data processor. Retention is whatever you decide it is. And your compliance team doesn't need to review yet another vendor's DPA.

Cloud APIs create a compliance surface. Local inference eliminates it entirely.

For many organizations, this is the difference between "interesting proof of concept" and "something we can actually deploy."

· · ·

The Model: Qwen3-VL-8B

Qwen3-VL-8B is an open multimodal model that can process image inputs and generate structured text outputs. At 8B parameters, it sits in the sweet spot for local deployment: powerful enough for reliable document extraction, small enough to run on a single GPU with quantization.

The setup is straightforward: host the model locally via LM Studio, which exposes an OpenAI-compatible API endpoint. Use 4-bit or 8-bit quantization depending on your hardware. Your application talks to it like any other API—except the endpoint is localhost.

What makes it work well for invoices specifically:

Capability	Why it matters for invoices
Document layout understanding	Correctly identifies table boundaries, column headers, and spatial groupings—even in messy scanned documents
JSON schema adherence	When prompted with a target schema, reliably produces valid, parseable JSON instead of prose descriptions
Quantization tolerance	4-bit quantized versions maintain strong extraction quality, enabling deployment on consumer GPUs (16GB VRAM)
Inference speed	Processes a typical single-page invoice in seconds, making batch workflows practical

· · ·

How It Works: End to End

The full pipeline is intentionally simple. Simplicity is a feature—fewer moving parts means fewer failure modes.

Six steps from messy PDFs to clean spreadsheets. The model does the hard work at step 3.

A key design choice: converting PDFs to PNG images before inference. This normalizes the input. It doesn't matter if the PDF is natively digital, scanned, rotated, or generated by 15 different accounting packages. The model always gets the same kind of input: a page image.

Here's what the actual application looks like:

localhost:8501

Invoice Processing Application — upload, extract, review, and export invoice data

The invoice extraction app: upload PDFs on the left, review structured results on the right, export to Excel.

· · ·

Schema-Driven Extraction

The extraction isn't freeform. The model is prompted with an explicit target schema, and it returns data shaped to match. This eliminates the ambiguity that plagues open-ended prompts and makes downstream processing predictable.

The schema has two types of fields:

Scalar Fields

vendor_name
invoice_number
invoice_date
total_amount
currency

Array Fields

line_items[].description
line_items[].quantity
line_items[].unit_price
line_items[].amount

    prompt snippet
    // The model receives this schema as part of the prompt
// and returns data that matches it exactly

{
  "vendor_name": "string",
  "invoice_number": "string",
  "invoice_date": "string (YYYY-MM-DD)",
  "total_amount": "number",
  "currency": "string (ISO 4217)",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "amount": "number"
    }
  ]
}
  

This schema-driven approach means you can extend the system to new document types by changing the schema—not the code. Purchase orders, shipping documents, contracts: same pipeline, different schema.

· · ·

Beyond Invoices: A Local Document Intelligence Layer

Here's the bigger picture. This system isn't really an "invoice tool." It's a pattern: local multimodal model + schema prompt + structured output = general document intelligence.

One model, many document workflows. The same architecture that extracts invoices today handles purchase orders, contracts, and compliance forms tomorrow.

Instead of building and maintaining vendor-specific parsers, regex pipelines, and template-matching logic, you deploy one multimodal model as shared infrastructure. The same model supports:

🧾Invoices

📋Purchase Orders

📑Contracts

🚚Shipping Docs

✅Compliance Forms

📈Internal Reports

· · ·

Operational Tradeoffs (Honest Assessment)

Local inference isn't free. You're trading cloud convenience for control. Here's what that looks like in practice:

Dimension	Local Inference	Cloud API
Data privacy	Full control, nothing leaves your network	Depends on vendor policies
Hardware	GPU recommended (16GB+ VRAM)	None required
Maintenance	You own model updates, monitoring, tuning	Vendor-managed
Cost model	Fixed (hardware), predictable	Variable (per-token), can spike
Compliance	Simplified—no third-party DPA needed	Requires vendor security review
Scaling	Limited by hardware, add GPUs to scale	Effectively unlimited

For teams processing hundreds or low thousands of invoices per month with sensitive financial data, local inference is often the better fit. If you're processing millions of documents and data sensitivity isn't a concern, cloud APIs may make more sense.

· · ·

Try It Yourself

The fastest way to get running is the Docker image. It packages the entire web app—you just need a local LLM endpoint (like LM Studio) running alongside it.

Prerequisites: Docker installed on your machine, and a local vision LLM running (e.g. Qwen3-VL-8B via LM Studio). Make sure your LM Studio server is started and accessible.

Pull the image

docker pull klaushofenbitzer/invoice_processing

Run the container

docker run -p 8501:8501 \
  --add-host=host.docker.internal:host-gateway \
  klaushofenbitzer/invoice_processing

The --add-host flag lets the container reach LM Studio on your host machine.

Open the app

Navigate to http://localhost:8501 — drag in an invoice and go.

That's it. Three commands, and you have a working invoice extraction system running entirely on your local machine.

Want to customize, extend the schema, or dig into the code? The full source is on GitHub:

· · ·

The Shift That Matters

We're moving into a world where organizations deploy internal AI infrastructure rather than just consuming external APIs. Document processing is one of the highest-ROI entry points because the documents are messy, the data is sensitive, the automation value is immediate, and the accuracy requirements are clear and measurable.

Local multimodal models make this possible: AI capabilities inside the security boundary, not outside it.

Invoice extraction happens to be the most obvious place to start. But the architecture—local vision model, schema-driven prompts, structured output, human review—applies far more broadly. If you have documents that humans currently read and transcribe, this pattern can probably help.

TL;DR

Convert PDFs to images. Send them to a local vision LLM (Qwen3-VL-8B via LM Studio). Prompt with a target schema. Get structured JSON back. Review. Export to Excel. No data leaves your network. No vendor lock-in. No OCR pipeline to babysit.