Turning Document Chaos into Structured Gold

Most LLMs choke on real-world documents: PDFs with 100+ pages, spreadsheets packed with edge cases, or forms where the layout is half the meaning. And if you’ve ever tried to extract nested data from these formats, you know that it’s rarely as simple as “just call an API.”

Since launching Document AI, we’ve helped real-world businesses structure the unstructured. We’ve learned:

  • LLMs struggle with deeply nested fields or schemas with 100+ data points

  • Accuracy tanks when the critical info is buried on page 86 of a 100-page doc

So we shipped the most powerful structured extraction system yet:

  • Page Classification + Filtering — Pre-filter irrelevant pages to boost accuracy by 10–15% on large documents

  • Arrayed Schemas — Define multiple schemas independently in one call, making complex data extraction more accurate without added complexity

Coming soon: Citations for extracted fields, so you’ll know exactly where each value came from.

This edition of the Document Digest dives into how we’re making structured extraction smarter, more scalable, and LLM-native.

Got a gnarly document you’re trying to parse? Book a time and help shape what’s next.

Structured Extraction with Page Classification

To extract structured data from only the relevant pages of a document, use the page_classes option in your structured extraction request. This allows you to reference page classes you previously defined and only run schema extraction on pages that match.

Try this out in this colab notebook.

Example

  1. Define your page classes:

  1. Define your structured data extraction schemas:

  1. Parse your document

  1. Use the results

In this example, the extractor first classifies each page and only applies your schemas to pages labeled "signature_page" or "terms_and_conditions".

This improves performance and accuracy, especially in documents with 30+ pages, by reducing irrelevant noise for the LLM.

Read the announcement on our blog (colab notebook included) or get started with our docs.

Try our latest integrations

LangChain x Tensorlake

Qdrant x Tensorlake

Tensorlake TL;DR

Page Classifications

  • Pre-filter irrelevant pages to boost accuracy by 10–15% on large documents

  • Define multiple schemas independently in one call, making complex data extraction more accurate without added complexity

  • Read the announcement (colab notebook included)

API v2 Released

  • Supports multiple Structured Data Extraction schemas plus Page Classification in the same, single parsing API call.

  • Enables partition-by-page, section, or fragment extraction to handle multi-section documents cleanly.

  • Check out the full API reference in our docs

  • Make sure you’re running the latest version of the Python SDK

Major Model Improvements

  • Improved checkbox detection, complex table parsing, handwriting, strikethrough, signatures, and mixed media content accuracy.

Expanded Format Support

  • Tensorlake DocAI now robustly handles: spreadsheets, CSV, presentations, Word, PDF, raw text, and images.

Why does this matter for developers

  • Cleaner inputs for RAG, agent pipelines, and VectorDBs/ETL, enabling pipelines to act on trustworthy structured fields and layouts, not just raw blobs.

  • Single API call now handles extraction + classification + layout + markdown chunking. No brittle multi-step orchestration.

  • Multi-format support saves devs from custom ingestion hacks for Excel, scans, and hybrid docs.

Until next time, happy parsing, extracting, and powering smarter agents with cleaner data.

—Sarah and the Tensorlake Team

The Tensorlake TL;DR Table

What We Launched

How It Helps Devs

API v2

More features, improved developer experience

Page Classification

Pre-filter irrelevant pages to boost accuracy by 10–15% on large documents

LangChain Tool

Faster agent embedding pipelines & docs parsing

Qdrant Integration

Structured parse → Vector search in one flow

Model upgrades

Better detection of checkboxes, tables, handwriting

Expanded format support

Reliable ingestion across all common document types