The brief was deceptively simple: “We have people reviewing PDFs all day. Can AI do it?” The answer was yes, but the path there required solving problems that aren't covered in any LangChain tutorial.
The problem with naive RAG
Our first instinct — retrieve relevant chunks, pass to GPT-4, extract fields — worked on clean PDFs. It failed on scanned documents, multi-column layouts, and anything that wasn't a well-formatted text file. Before we could build the AI pipeline, we had to solve the document ingestion problem.
We ended up with a pre-processing layer: PDF → OCR (for scanned docs) → layout analysis → structured text chunks with positional metadata. Only then did the LLM layer produce reliable results.
The classification layer
The client handled seven document types — each requiring different extraction logic. Rather than prompting GPT-4 to identify the type (expensive, slow), we fine-tuned a smaller classifier on 2,000 labelled examples. It ran in under 50ms per document and achieved 99.7% accuracy on type classification. GPT-4 only came in for the extraction step, where its reasoning was actually needed.
Confidence scoring and human-in-the-loop
The hardest engineering problem wasn't extraction — it was knowing when to trust the extraction. We built a confidence scoring system that assessed:
- Field-level extraction confidence from the model's logprob output
- Cross-field consistency checks (do the dates make chronological sense?)
- Document quality score from the OCR layer
Documents below our confidence threshold went to a human review queue — which accounted for 4% of volume. The other 96% were processed automatically.
Infrastructure for scale
At 50,000 documents per day, the pipeline needed to handle spikes. We ran it on AWS ECS with Celery workers backed by Redis queuing. During peak periods, the worker pool auto-scaled to 40 containers. Average end-to-end processing time: under 4 minutes from upload to structured output.
What we'd do differently
We'd invest more time upfront in the confidence scoring framework. We got it right eventually, but it was the most iterative part of the build. The failure mode for AI automation isn't dramatic — it's subtle errors that accumulate. A well-designed confidence layer catches them before they become a problem.