Invoices are "structured" the way a teenager's bedroom is "organized"—everything is technically there, but good luck finding it programmatically.
They arrive as PDFs with wildly inconsistent layouts. Nested tables sit next to logos. Tax rows hide below fold lines. Vendor A puts the total top-right; Vendor B buries it on page two. And somewhere in the middle, your AP team is copy-pasting line items into a spreadsheet, because the "automation" you built three years ago keeps breaking.
There's a better way now. Vision-capable LLMs can look at an invoice image the way a human does—and extract structured data from it directly. No OCR chain. No template matching. No regex graveyard.
This article walks through a fully local system that does exactly this, powered by Qwen3-VL-8B, running on your own hardware. Every byte of financial data stays inside your network.
The Problem With "Structured" Documents
If you've ever tried to automate invoice processing, you've hit the same wall everyone else has. Invoices are structured for human eyes, not machines. The "structure" is entirely visual: columns implied by alignment, totals associated with labels by proximity, line items grouped by whitespace.
Here's what makes them brutal for automation:
<table> tags in a PDF. Pages may be scanned, rotated, multi-page, or image-only. Stamps, logos, and handwritten notes add noise everywhere.
Rule-based systems and template matchers work fine—until they don't. And they stop working exactly when it matters: when you onboard a new vendor, when someone updates their invoice template, or when a batch of scanned PDFs lands in your inbox.
OCR Pipelines vs. Vision LLMs
The standard approach for years has been an OCR-first pipeline. Convert the PDF to text, then throw it at a language model. It works—sort of. But it has a fundamental flaw: OCR destroys the one thing that makes invoices readable: layout.
The OCR pipeline loses spatial information at step 2. The vision pipeline preserves it end-to-end.
When OCR converts a PDF to plain text, it flattens the page. Row/column alignment disappears. A line item that was clearly in a table now looks like a random string next to a number. The LLM downstream has to guess what was a column header and what was a value—and it guesses wrong more often than you'd like.
A vision model skips this entirely. It takes a page image as input, sees the spatial relationships, and reasons over them directly. Table boundaries. Column alignment. The way a total sits beneath a column of numbers. The spatial logic that makes an invoice readable to you is exactly what the model uses too.
OCR Pipeline
- Layout destroyed during extraction
- Table rows/columns misaligned
- Small text & stamps cause noise
- Needs post-processing heuristics
- Vendor-specific tuning required
Vision LLM
- Layout preserved as visual input
- Table geometry directly understood
- Handles noise naturally
- Schema-driven clean output
- Vendor-agnostic by design
Why This Has to Run Locally
Here's the part most AI demos conveniently skip: invoices are sensitive documents.
They contain vendor banking details, tax IDs, internal cost centers, payment terms, and sometimes PII. Sending them to an external API—even a reputable one—opens a can of compliance worms that most finance teams won't touch.
Running inference locally eliminates the entire category of risk. Your documents never leave your network. There's no third-party data processor. Retention is whatever you decide it is. And your compliance team doesn't need to review yet another vendor's DPA.
Cloud APIs create a compliance surface. Local inference eliminates it entirely.
For many organizations, this is the difference between "interesting proof of concept" and "something we can actually deploy."
The Model: Qwen3-VL-8B
Qwen3-VL-8B is an open multimodal model that can process image inputs and generate structured text outputs. At 8B parameters, it sits in the sweet spot for local deployment: powerful enough for reliable document extraction, small enough to run on a single GPU with quantization.
The setup is straightforward: host the model locally via LM Studio, which exposes an OpenAI-compatible API endpoint. Use 4-bit or 8-bit quantization depending on your hardware. Your application talks to it like any other API—except the endpoint is localhost.
What makes it work well for invoices specifically:
| Capability | Why it matters for invoices |
|---|---|
| Document layout understanding | Correctly identifies table boundaries, column headers, and spatial groupings—even in messy scanned documents |
| JSON schema adherence | When prompted with a target schema, reliably produces valid, parseable JSON instead of prose descriptions |
| Quantization tolerance | 4-bit quantized versions maintain strong extraction quality, enabling deployment on consumer GPUs (16GB VRAM) |
| Inference speed | Processes a typical single-page invoice in seconds, making batch workflows practical |
How It Works: End to End
The full pipeline is intentionally simple. Simplicity is a feature—fewer moving parts means fewer failure modes.
Six steps from messy PDFs to clean spreadsheets. The model does the hard work at step 3.
A key design choice: converting PDFs to PNG images before inference. This normalizes the input. It doesn't matter if the PDF is natively digital, scanned, rotated, or generated by 15 different accounting packages. The model always gets the same kind of input: a page image.
Here's what the actual application looks like:
The invoice extraction app: upload PDFs on the left, review structured results on the right, export to Excel.
Schema-Driven Extraction
The extraction isn't freeform. The model is prompted with an explicit target schema, and it returns data shaped to match. This eliminates the ambiguity that plagues open-ended prompts and makes downstream processing predictable.
The schema has two types of fields:
Scalar Fields
vendor_nameinvoice_numberinvoice_datetotal_amountcurrency
Array Fields
line_items[].descriptionline_items[].quantityline_items[].unit_priceline_items[].amount
// The model receives this schema as part of the prompt
// and returns data that matches it exactly
{
"vendor_name": "string",
"invoice_number": "string",
"invoice_date": "string (YYYY-MM-DD)",
"total_amount": "number",
"currency": "string (ISO 4217)",
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"amount": "number"
}
]
}
This schema-driven approach means you can extend the system to new document types by changing the schema—not the code. Purchase orders, shipping documents, contracts: same pipeline, different schema.
Beyond Invoices: A Local Document Intelligence Layer
Here's the bigger picture. This system isn't really an "invoice tool." It's a pattern: local multimodal model + schema prompt + structured output = general document intelligence.
Instead of building and maintaining vendor-specific parsers, regex pipelines, and template-matching logic, you deploy one multimodal model as shared infrastructure. The same model supports:
Operational Tradeoffs (Honest Assessment)
Local inference isn't free. You're trading cloud convenience for control. Here's what that looks like in practice:
| Dimension | Local Inference | Cloud API |
|---|---|---|
| Data privacy | Full control, nothing leaves your network | Depends on vendor policies |
| Hardware | GPU recommended (16GB+ VRAM) | None required |
| Maintenance | You own model updates, monitoring, tuning | Vendor-managed |
| Cost model | Fixed (hardware), predictable | Variable (per-token), can spike |
| Compliance | Simplified—no third-party DPA needed | Requires vendor security review |
| Scaling | Limited by hardware, add GPUs to scale | Effectively unlimited |
For teams processing hundreds or low thousands of invoices per month with sensitive financial data, local inference is often the better fit. If you're processing millions of documents and data sensitivity isn't a concern, cloud APIs may make more sense.
Try It Yourself
The fastest way to get running is the Docker image. It packages the entire web app—you just need a local LLM endpoint (like LM Studio) running alongside it.
Pull the image
docker pull klaushofenbitzer/invoice_processingRun the container
docker run -p 8501:8501 \
--add-host=host.docker.internal:host-gateway \
klaushofenbitzer/invoice_processingThe --add-host flag lets the container reach LM Studio on your host machine.
Open the app
Navigate to http://localhost:8501 — drag in an invoice and go.
That's it. Three commands, and you have a working invoice extraction system running entirely on your local machine.
Want to customize, extend the schema, or dig into the code? The full source is on GitHub:
The Shift That Matters
We're moving into a world where organizations deploy internal AI infrastructure rather than just consuming external APIs. Document processing is one of the highest-ROI entry points because the documents are messy, the data is sensitive, the automation value is immediate, and the accuracy requirements are clear and measurable.
Local multimodal models make this possible: AI capabilities inside the security boundary, not outside it.
Invoice extraction happens to be the most obvious place to start. But the architecture—local vision model, schema-driven prompts, structured output, human review—applies far more broadly. If you have documents that humans currently read and transcribe, this pattern can probably help.