Tensorlake September Updates: Native PDF Support + Advanced RAG Features

Engineers building document workflows - this one's for you.

The document processing landscape is always evolving. OCR accuracy still hovers around 80-85% for clean documents, dropping to 64% for handwritten text, while traditional OCR approaches struggle with complex layouts and multi-column documents. Meanwhile, you're dealing with increasingly complex RAG pipelines that need better structured data extraction.

Here's what we shipped this month to help you build more robust document workflows.

🔧 Enhanced Document Ingestion

New file format support

  • XML parsing - Handle configuration files, structured data exports

  • DOC files - Legacy Word documents (not just DOCX)

  • Native Markdown - Documentation, README files, structured text

Impact: Broader pipeline compatibility, fewer preprocessing steps

⚡ Document Processing Improvements

Native PDF parsing capabilities

  • Direct PDF text extraction - No more PDF→image→OCR conversion pipeline

  • Preserves document structure - Maintains layout relationships and text positioning

  • Faster processing - Eliminates image conversion bottleneck

  • Higher accuracy - Direct access to embedded text layers when available

Why it matters: Pipeline-based PDF systems often suffer from error propagation between components. Direct parsing eliminates conversion artifacts and improves extraction quality.

Large table processing fixes

  • Token limit handling - Dense and long tables in CSV/Excel files can now be processed without hitting token limits.

  • Improved performance - Processing speed for these tasks is significantly faster than before.

  • Result merging - Seamless reassembly of split table extractions

  • Accuracy preservation - Maintains relationships across table chunks

🎯 Smarter Page Classification

Multi-label classification (new default)

  • Multiple page types per page - Single page can be account_info AND transactions 

  • Better complex document handling - Bank statements, legal docs, mixed-content pages

  • Backward compatible - Multi-class still available via config

Technical context: Multi-class assigns exactly one label per instance (mutually exclusive), while multi-label allows multiple simultaneous labels. For complex financial documents where account info and transaction data overlap on pages, multi-label classification provides more accurate semantic understanding.

Classification reasoning output

  • Prompt engineering insights - Understand model decision-making process

  • Debug classification errors - See why pages got specific labels

  • Improve prompt quality - Iterate based on reasoning chains

📝 Enhanced Summarization Control

Optional full-page image summaries

  • Spatial context preservation - Understand fragment positions on complex layouts

  • Hallucination reduction - Choose between fragment-level vs. full-page summarization

  • Layout-aware processing - Better handling of multi-column, form-based documents

When to use: Complex forms where spatial relationships matter (e.g., insurance claims with signature locations, technical diagrams with callouts).

🔧 Production Reliability

Bug fixes & improvements

  • Schema validation - Stricter input/output validation

  • Citation page filtering - Fixed issue where page class limits weren't applied with citations enabled (Read the changelog)

  • Error messaging - More descriptive timeout and token limit errors

  • Token management - Better handling of edge cases in large document processing

Try these updates

  • Try Tensorlake - Test new features with your documents

  • Schedule a call with us - Questions on implementation? Get hands-on help leveraging all Tensorlake has to offer for your pipeline

Built for ML engineers who need document processing that actually works in production.