- The Document Digest by Tensorlake
- Posts
- Tensorlake September Updates: Native PDF Support + Advanced RAG Features
Tensorlake September Updates: Native PDF Support + Advanced RAG Features
Engineers building document workflows - this one's for you.
The document processing landscape is always evolving. OCR accuracy still hovers around 80-85% for clean documents, dropping to 64% for handwritten text, while traditional OCR approaches struggle with complex layouts and multi-column documents. Meanwhile, you're dealing with increasingly complex RAG pipelines that need better structured data extraction.
Here's what we shipped this month to help you build more robust document workflows.
🔧 Enhanced Document Ingestion
New file format support
XML parsing - Handle configuration files, structured data exports
DOC files - Legacy Word documents (not just DOCX)
Native Markdown - Documentation, README files, structured text
Impact: Broader pipeline compatibility, fewer preprocessing steps
⚡ Document Processing Improvements
Native PDF parsing capabilities
Direct PDF text extraction - No more PDF→image→OCR conversion pipeline
Preserves document structure - Maintains layout relationships and text positioning
Faster processing - Eliminates image conversion bottleneck
Higher accuracy - Direct access to embedded text layers when available
Why it matters: Pipeline-based PDF systems often suffer from error propagation between components. Direct parsing eliminates conversion artifacts and improves extraction quality.
Large table processing fixes
Token limit handling - Dense and long tables in CSV/Excel files can now be processed without hitting token limits.
Improved performance - Processing speed for these tasks is significantly faster than before.
Result merging - Seamless reassembly of split table extractions
Accuracy preservation - Maintains relationships across table chunks
🎯 Smarter Page Classification
Multi-label classification (new default)
Multiple page types per page - Single page can be
account_info
ANDtransactions
Better complex document handling - Bank statements, legal docs, mixed-content pages
Backward compatible - Multi-class still available via config
Technical context: Multi-class assigns exactly one label per instance (mutually exclusive), while multi-label allows multiple simultaneous labels. For complex financial documents where account info and transaction data overlap on pages, multi-label classification provides more accurate semantic understanding.
Classification reasoning output
Prompt engineering insights - Understand model decision-making process
Debug classification errors - See why pages got specific labels
Improve prompt quality - Iterate based on reasoning chains
📝 Enhanced Summarization Control
Optional full-page image summaries
Spatial context preservation - Understand fragment positions on complex layouts
Hallucination reduction - Choose between fragment-level vs. full-page summarization
Layout-aware processing - Better handling of multi-column, form-based documents
When to use: Complex forms where spatial relationships matter (e.g., insurance claims with signature locations, technical diagrams with callouts).
🔧 Production Reliability
Bug fixes & improvements
Schema validation - Stricter input/output validation
Citation page filtering - Fixed issue where page class limits weren't applied with citations enabled (Read the changelog)
Error messaging - More descriptive timeout and token limit errors
Token management - Better handling of edge cases in large document processing
Try these updates
Try Tensorlake - Test new features with your documents
Schedule a call with us - Questions on implementation? Get hands-on help leveraging all Tensorlake has to offer for your pipeline
Built for ML engineers who need document processing that actually works in production.