The Document Digest by Tensorlake
Posts
Tensorlake September Updates: Native PDF Support + Advanced RAG Features

Tensorlake September Updates: Native PDF Support + Advanced RAG Features

Engineers building document workflows - this one's for you.

Sarah Guthals
September 22, 2025

The document processing landscape is always evolving. OCR accuracy still hovers around 80-85% for clean documents, dropping to 64% for handwritten text, while traditional OCR approaches struggle with complex layouts and multi-column documents. Meanwhile, you're dealing with increasingly complex RAG pipelines that need better structured data extraction.

Here's what we shipped this month to help you build more robust document workflows.

🔧 Enhanced Document Ingestion

New file format support

XML parsing - Handle configuration files, structured data exports
DOC files - Legacy Word documents (not just DOCX)
Native Markdown - Documentation, README files, structured text

Impact: Broader pipeline compatibility, fewer preprocessing steps

→ Read the changelog

⚡ Document Processing Improvements

Native PDF parsing capabilities

Direct PDF text extraction - No more PDF→image→OCR conversion pipeline
Preserves document structure - Maintains layout relationships and text positioning
Faster processing - Eliminates image conversion bottleneck
Higher accuracy - Direct access to embedded text layers when available

Why it matters: Pipeline-based PDF systems often suffer from error propagation between components. Direct parsing eliminates conversion artifacts and improves extraction quality.

Large table processing fixes

Token limit handling - Dense and long tables in CSV/Excel files can now be processed without hitting token limits.
Improved performance - Processing speed for these tasks is significantly faster than before.
Result merging - Seamless reassembly of split table extractions
Accuracy preservation - Maintains relationships across table chunks

→ Read the changelog

🎯 Smarter Page Classification

Multi-label classification (new default)

Multiple page types per page - Single page can be account_info AND transactions
Better complex document handling - Bank statements, legal docs, mixed-content pages
Backward compatible - Multi-class still available via config

Technical context: Multi-class assigns exactly one label per instance (mutually exclusive), while multi-label allows multiple simultaneous labels. For complex financial documents where account info and transaction data overlap on pages, multi-label classification provides more accurate semantic understanding.

→ Read the changelog

Classification reasoning output

Prompt engineering insights - Understand model decision-making process
Debug classification errors - See why pages got specific labels
Improve prompt quality - Iterate based on reasoning chains

→ Read the changelog

📝 Enhanced Summarization Control

Optional full-page image summaries

Spatial context preservation - Understand fragment positions on complex layouts
Hallucination reduction - Choose between fragment-level vs. full-page summarization
Layout-aware processing - Better handling of multi-column, form-based documents

When to use: Complex forms where spatial relationships matter (e.g., insurance claims with signature locations, technical diagrams with callouts).

→ Read the changelog

🔧 Production Reliability

Bug fixes & improvements

Schema validation - Stricter input/output validation
Citation page filtering - Fixed issue where page class limits weren't applied with citations enabled (Read the changelog)
Error messaging - More descriptive timeout and token limit errors
Token management - Better handling of edge cases in large document processing

Try these updates

Try Tensorlake - Test new features with your documents
Schedule a call with us - Questions on implementation? Get hands-on help leveraging all Tensorlake has to offer for your pipeline

Built for ML engineers who need document processing that actually works in production.