Field note: the three edge cases that break document pipelines
Three things break document pipelines in production. None of them survive a generic OCR pass. We've seen all three more than once.
If your pipeline is being benchmarked against a clean test set and accuracy looks great, congratulations: you've validated a demo. The cases that follow are what production looks like.
1. Stamps and overlays on top of text
A 'received' stamp covering a date. A redaction box across half a paragraph. A signature looped through three lines of an address. Generic OCR sees one layer; the page has two.
What we do: a multi-pass approach. First pass identifies overlay regions (stamps, redactions, signatures) and masks them. Second pass extracts text from the unmasked regions. Third pass attempts to recover the partially-occluded content from contextual fields elsewhere in the document. Cross-reference against the entity-resolution graph. If the recovered value disagrees with what's elsewhere, route to a human. If it doesn't appear elsewhere, the field is marked 'occluded' rather than 'missing' — those are different states downstream.
2. Mixed-language paragraphs
A US filing with a paragraph quoted in German. A Spanish-language contract with English boilerplate at the bottom. A scanned letter where one sentence switches mid-line because the writer borrowed a phrase.
Single-language LLMs misclassify or skip these. Single-language OCR is worse — it'll silently drop characters that don't fit the alphabet it expects. The fix isn't a fancier model. It's segmentation: detect language at the span level (not the page level), route each span to the right extractor, then reassemble.
3. Visual tables that aren't tables
A row of indented bullets that someone formatted to look like a table. A two-column layout where the left column is figure captions and the right column is unrelated body text. A 'table' with no borders, just whitespace alignment.
Layout-aware models help, but they only get you so far. The reliable fix is to convert anything table-shaped into a graph of cells with explicit relationships — header, sibling, sub-row — and then ask the model to reason over the graph, not the visual structure. Tables that aren't tables semantically (the bulleted-list-pretending-to-be-a-table case) get marked as such and excluded from table extraction entirely. They go through the regular paragraph pipeline.
Edge cases like these are the difference between a working demo and a production system. They're also what eats the budget if you don't plan for them — every one of these patterns took us a couple of weeks to solve the first time and an afternoon to recognize the second.
Have a workflow that fits the patterns above?
Thirty minutes, no slideware. We'll tell you honestly whether AI fits and where it doesn't.
Book a working session →