Financial Services Document AI AI & ML Engineering Intelligent Pipelines On-Premise

Universal Document Processor — Classify, Route, Extract at Scale

A leading private sector bank was running every document through large LLMs — expensive, slow, and unnecessary for the majority of structured inputs. BootLabs built an intelligent pipeline that classifies first and routes each document to the most cost-appropriate model.

Document processing pipeline for a leading private sector bank

The Challenge

15+ document types, one over-engineered pipeline costing far too much

A leading private sector bank processed 15+ document types daily — KYC forms, loan applications, bank statements, salary slips, utility bills, cheques, trade finance documents, account opening forms, IT returns, property papers, correspondence letters, and more. Each type had a different structure, layout, and required different downstream processing. The bank had been routing all documents through large LLMs — which was expensive, slow, and fundamentally over-engineered for structured documents that could be handled with far simpler models. BootLabs built a universal document processing pipeline that classifies each document type first, then routes it to the most cost-appropriate model: rule-based extraction for structured documents, fine-tuned lightweight models for semi-structured ones, and LLMs only for complex unstructured documents. The result was a 60%+ reduction in AI processing costs while maintaining 95%+ extraction accuracy across all document categories.

Client Snapshot

Client A Leading Private Sector Bank

Industry Financial Services — Operations & Compliance

Scale 15+ document types processed daily at enterprise volume

Project On-premise AI document intelligence pipeline

Business Challenges

What was holding them back

Heterogeneous Document Landscape

15+ document types with completely different layouts, structures, and extraction requirements — KYC forms, cheques, loan applications, trade finance documents, legal correspondence, and more. No single model or approach could handle all of them well, yet the bank had no routing logic in place.

LLM Cost Unsustainable at Scale

Running every document through a large LLM was expensive at the bank's volumes. 60–70% of documents were structured enough to not need LLM processing at all — standard forms, cheques, and salary slips could be handled with deterministic extraction at a fraction of the cost. The bank was paying premium AI prices for commodity data extraction.

Accuracy and Auditability Requirements

Regulatory compliance demanded high extraction accuracy and a full audit trail — every field, every document, traceable to source. The existing pipeline produced no confidence scores, no provenance pointers, and no mechanism for flagging low-confidence extractions for human review.

Our Approach

How we solved it

Document Classification Layer

An initial classifier — lightweight CNN combined with rule-based heuristics — identifies the document type from layout features, headers, and content signals. This classification step gates everything downstream, ensuring each document is routed to the right extraction path from the start. No LLM involved at this stage.

Pre-Processing & Normalisation

OCR (Tesseract with custom post-processing), layout analysis, table detection, and field extraction normalise all inputs before model routing. The pipeline handles scanned documents, photographed copies, and digital-native files — producing a consistent, structured representation regardless of input format.

Intelligent Model Routing

Structured documents (standard forms, cheques) are handled with rule-based extraction. Semi-structured documents (bank statements, salary slips) go to fine-tuned BERT/LayoutLM models. Complex unstructured documents (letters, legal docs, trade finance) are routed to an LLM with RAG grounding. LLM usage drops to approximately 30% of total volume.

Audit Trail & Confidence Scoring

Every extracted field carries a confidence score and a provenance pointer back to the exact region of the source document it was derived from. Low-confidence extractions are automatically flagged for human review. The full trace is stored for compliance — every extraction is explainable and auditable end to end.

The Outcomes

Results that proved the approach

60%+

Cost reduction in AI processing vs. all-LLM baseline

15+

Document types handled by a single unified pipeline

95%+

Extraction accuracy across all document categories

The classification-first approach was the key insight — rather than trying to build a single model that handled every document type equally, the pipeline treats classification as a first-class routing decision. LLMs are expensive and powerful; using them on a standard cheque or a templated salary slip is wasteful. By reserving LLM capacity for genuinely complex, unstructured documents, the bank achieved better accuracy at dramatically lower cost — while the audit trail gave compliance teams confidence the entire process is traceable and defensible.

Business Impact

What changed for the organisation

4x Throughput, Same Headcount

Document processing throughput increased fourfold with no additional operational headcount. The pipeline handles the full classification-to-extraction workflow automatically, with human reviewers only pulled in for genuinely ambiguous cases flagged by confidence scoring.

Regulatory Audit Readiness

Every extraction is now traceable and explainable — confidence scores, provenance pointers, and a full processing trace are stored for every document. Audit requests that previously required manual reconstruction can now be answered instantly from the pipeline's own logs.

Faster Customer Onboarding

KYC and loan document processing became a bottleneck for onboarding cycle times. With the accelerated pipeline, document-related delays in the onboarding journey were substantially reduced — a direct improvement to customer experience and revenue velocity.

LLM Spend Under Control

Large models are now used only where they genuinely add value — complex unstructured documents where layout, context, and reasoning matter. Structured and semi-structured documents are handled by cheaper, faster, purpose-built models. The bank has a principled, cost-efficient AI stack rather than a single expensive default.