The Document Digest by Tensorlake
Posts
Turning Document Chaos into Structured Gold

Turning Document Chaos into Structured Gold

Sarah Guthals
July 31, 2025

Most LLMs choke on real-world documents: PDFs with 100+ pages, spreadsheets packed with edge cases, or forms where the layout is half the meaning. And if you’ve ever tried to extract nested data from these formats, you know that it’s rarely as simple as “just call an API.”

Since launching Document AI, we’ve helped real-world businesses structure the unstructured. We’ve learned:

LLMs struggle with deeply nested fields or schemas with 100+ data points
Accuracy tanks when the critical info is buried on page 86 of a 100-page doc

So we shipped the most powerful structured extraction system yet:

Page Classification + Filtering — Pre-filter irrelevant pages to boost accuracy by 10–15% on large documents
Arrayed Schemas — Define multiple schemas independently in one call, making complex data extraction more accurate without added complexity

Coming soon: Citations for extracted fields, so you’ll know exactly where each value came from.

This edition of the Document Digest dives into how we’re making structured extraction smarter, more scalable, and LLM-native.

Got a gnarly document you’re trying to parse? Book a time and help shape what’s next.

Structured Extraction with Page Classification

To extract structured data from only the relevant pages of a document, use the page_classes option in your structured extraction request. This allows you to reference page classes you previously defined and only run schema extraction on pages that match.

Try this out in this colab notebook.

Example

Define your page classes:

Define your structured data extraction schemas:

Parse your document

Use the results

In this example, the extractor first classifies each page and only applies your schemas to pages labeled "signature_page" or "terms_and_conditions".

This improves performance and accuracy, especially in documents with 30+ pages, by reducing irrelevant noise for the LLM.

Read the announcement on our blog (colab notebook included) or get started with our docs.

Try our latest integrations

LangChain x Tensorlake

Try our our LangChain Tool next time you build a LangGraph agent
Learn why Tensorlake is the solution to ensuring all relevant data is extracted and ready to integrate within your agentic workflow in our blog post.
Watch a 5-min video tutorial on how to use the LangChain Tool
Learn how to use the Tool with this quick tutorial (colab notebook included)

Qdrant x Tensorlake

Learn how to implement next-level RAG with structured filtering, semantic search, and comprehensive document understanding, all integrated seamlessly in our Qdrant x Tensorlake blog post.
Watch a 10-min video tutorial on how to use Tensorlake, Qdrant, and LangGraph to get RAG, without compromise.
Learn how to use the tool with this quick tutorial (colab notebook included)

Tensorlake TL;DR

Page Classifications

Pre-filter irrelevant pages to boost accuracy by 10–15% on large documents
Define multiple schemas independently in one call, making complex data extraction more accurate without added complexity
Read the announcement (colab notebook included)

API v2 Released

Supports multiple Structured Data Extraction schemas plus Page Classification in the same, single parsing API call.
Enables partition-by-page, section, or fragment extraction to handle multi-section documents cleanly.
Check out the full API reference in our docs
Make sure you’re running the latest version of the Python SDK

Major Model Improvements

Improved checkbox detection, complex table parsing, handwriting, strikethrough, signatures, and mixed media content accuracy.

Expanded Format Support

Tensorlake DocAI now robustly handles: spreadsheets, CSV, presentations, Word, PDF, raw text, and images.

Why does this matter for developers

Cleaner inputs for RAG, agent pipelines, and VectorDBs/ETL, enabling pipelines to act on trustworthy structured fields and layouts, not just raw blobs.
Single API call now handles extraction + classification + layout + markdown chunking. No brittle multi-step orchestration.
Multi-format support saves devs from custom ingestion hacks for Excel, scans, and hybrid docs.

Until next time, happy parsing, extracting, and powering smarter agents with cleaner data.

—Sarah and the Tensorlake Team

The Tensorlake TL;DR Table

What We Launched	How It Helps Devs
API v2	More features, improved developer experience
Page Classification	Pre-filter irrelevant pages to boost accuracy by 10–15% on large documents
LangChain Tool	Faster agent embedding pipelines & docs parsing
Qdrant Integration	Structured parse → Vector search in one flow
Model upgrades	Better detection of checkboxes, tables, handwriting
Expanded format support	Reliable ingestion across all common document types