July 5, 2026 · aidocument · agents

What is an AIDocument? One clean shape for agents that read the web

An AIDocument is one stable JSON shape for any web page: Markdown body, title, description, structure, and measured economics. Here is how it works.

Your agent does not want a web page. It wants what the page says. Those are two very different things, and the gap between them is where most agent-reads-the-web code goes to die.

When you fetch a modern URL, you get an HTML document built for a browser and a human: a navigation bar, a cookie-consent banner, three analytics scripts, a newsletter modal, a footer with forty links, and somewhere in the middle, the paragraph your agent actually needed. Feed that raw into a model and you pay tokens for all of it. On a lot of pages the content is a single-digit percentage of the bytes. The rest is chrome.

An AIDocument is the answer to that: one clean, stable shape for any URL, built for a machine reader instead of a browser. This post is the pillar reference for the idea. If you build agents that read the web, this is the shape worth understanding.

The problem: agents get raw HTML

Here is what an agent has to deal with when it fetches a page itself.

First, the bloat. A content page is mostly not content. Navigation, scripts, style blocks, consent overlays, and repeated boilerplate dwarf the actual text. You are paying tokens, latency, and attention budget for markup your model will never cite.

Second, the parsing. Someone has to turn that HTML into text. Readability heuristics work until they meet a site that renders its body with JavaScript, at which point a naive fetch returns a hollow <div id="root"></div> and nothing else. Now you need a headless browser, and you need to decide when to spin one up, how long to wait for the page to settle, and how to strip the consent manager that popped up after render. That is a real infrastructure project, and it is not the project you set out to build.

Third, the inconsistency. Every site is shaped differently. Your extraction code that works on one blog breaks on the next. There is no single contract you can write against, so you write against all of them, forever.

The AIDocument collapses all three problems into one shape you read once and reuse everywhere.

The shape: body plus title, description, and structure

An AIDocument is a grouped JSON envelope. The same eight top-level groups come back for every URL and every caller, no matter how the source page was built:

schema: format identity and version. You pin against this.
source: where it came from and how it was fetched. Canonical URL, render mode, status code, freshness policy.
cache: whether the origin was contacted on this call, and whether the body was re-fetched.
identity: title, description, language, detected content type.
content: the cleaned, boilerplate-stripped Markdown. This is the part you feed a model.
structure: headings, links, images, and pre-extracted structured data (JSON-LD).
signals: derived quality signals such as word count, reading time, and whether the page carries structured data.
economics: token and cost economics for this exact page versus ingesting the raw HTML.

The important word is stable. The AIDocument shape is a public contract: additive changes only, never a breaking change without a major-version bump. You write your parsing once and it keeps working. Compare that to writing extraction against the shifting HTML of every site your agent might ever visit.

Markdown, not stripped HTML, is deliberate. It is dense, it preserves headings and lists and links that a model can reason over, and it drops the tags that carry no meaning. It is roughly the highest-signal representation of a page you can hand to a language model.

A real call, with a real response

The endpoint is POST /v1/aidocument. You send a URL, you get an AIDocument back.

curl -X POST https://api.lyrenth.com/v1/aidocument \
  -H "Authorization: Bearer $LYRENTH_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://en.wikipedia.org/wiki/Web_indexing"}'

The response is the grouped envelope, trimmed here to the parts you care about first:

{
  "schema":   { "name": "AIDocument", "version": "2.0" },
  "source":   { "url": ".../Web_indexing", "render_mode": "static", "status_code": 200 },
  "cache":    { "status": "hit", "origin_contacted": false },
  "identity": { "title": "Web indexing - Wikipedia", "language": "en" },
  "content":  { "markdown": "Web indexing or..." },
  "signals":  { "word_count": 1552, "reading_time": 7, "has_json_ld": true },
  "economics": {
    "raw_html_tokens_approx": 21331,
    "output_tokens_approx":   2715,
    "token_savings_percent":  0.873
  }
}

Look at the economics block. For that page, raw HTML came to roughly 21,331 tokens; the AIDocument content came to roughly 2,715. That is an 87.3% reduction, and it is not a marketing figure pulled from the air: every response carries its own measured economics for the exact URL you asked for, computed on that call. Token counts use a four-characters-per-token approximation, so treat the last digit as illustrative and the order of magnitude as the point.

You do not have to trust any single number. You can measure it yourself. When we read the IETF's own Robots Exclusion Protocol standard (RFC 9309) through the same reader, the response header reported the page as roughly 5,028 words returned as clean Markdown, and 98% smaller than the raw HTML. (Measured via the Lyrenth reader response header for that URL, July 2026.) The savings scale with how much chrome a page carries: a heavy documentation or news page saves the most; a nearly-empty page saves the least. The point holds either way. You stop paying for markup you never wanted.

What "structure" buys you

The content field is what most agents reach for first, but structure is where a lot of quiet value lives, because it is work you would otherwise redo on every site.

structure.headings gives you the document outline, already parsed, so you can chunk on real section boundaries instead of guessing. structure.links gives you the outbound links as data, which is exactly what a crawling or research agent needs to decide where to go next, without regexing anchors out of HTML. structure.images gives you the media. And structure.structured_data hands back the JSON-LD the publisher embedded: article metadata, product schemas, breadcrumbs, whatever the site author declared. That is machine-authored ground truth about the page, pre-extracted so your agent does not reimplement a JSON-LD parser.

Put together, the AIDocument is not just "the page as text." It is the page as text plus the page as structure, which is what an agent that reads for a living actually needs.

Crawl once, serve every agent

There is one more idea baked into the shape, and it is visible in the cache block above ("origin_contacted": false).

Reads resolve through a shared, cross-caller cache. The first agent to ask for a URL pays the origin fetch. Every agent after that is served the already-clean AIDocument from the shared index, without touching the origin site again within the freshness window. That is cheaper for you at scale and gentler on the sites being read, which matters if you care about being a good citizen of the web your agents depend on.

It also means freshness is a policy, not an accident. The default is cache-first. When you genuinely need the live version, you say so on the request (freshness_policy: "force_refresh") and take the origin fetch deliberately.

For the deeper economics of this, and why raw HTML wrecks a context window in the first place, see How to feed web pages to an LLM without blowing the context window. For the philosophy underneath, why an index for machine readers looks nothing like an index for browsers, see Agents don't browse, they read.

How to get one in under five minutes

You do not need to change how you build. There are three ways in, and they all return the same AIDocument shape.

The reader endpoint is the shortest path. One request, clean Markdown back:

curl -H "Authorization: Bearer $LYRENTH_KEY" \
  "https://api.lyrenth.com/v1/read?url=https://example.com/article"

The POST /v1/aidocument endpoint above gives you the full grouped envelope when you want the structure and economics, not just the body. Add max_tokens=N on either to cap the returned Markdown to your context budget, trimmed at a clean boundary.

If you work inside an assistant, the MCP server drops the same capability straight into Claude Desktop, Claude Code, Cursor, or any MCP client with one config block:

{
  "mcpServers": {
    "lyrenth": {
      "command": "npx",
      "args": ["-y", "lyrenth-mcp"],
      "env": { "LYRENTH_API_KEY": "your-key" }
    }
  }
}

That adds read_url, batch read_urls (up to 20 URLs in one call), and check_usage tools, so you can just ask the assistant to read a page and it comes back as clean text.

And there are SDKs when you want the shape native to your language:

from lyrenth import Lyrenth

client = Lyrenth()  # reads LYRENTH_API_KEY
doc = client.read("https://example.com/article")
print(doc.title)
print(doc.markdown)      # cleaned, agent-ready body

pip install lyrenth and npm i lyrenth-sdk both ship batch reads, the max_tokens cap, and adapters for the common agent frameworks.

The takeaway

An AIDocument is a decision to stop treating web pages as documents to render and start treating them as content to read. One stable shape. Body, title, description, and structure. Real measured economics on every call. Crawled once and served to every agent that wants the same URL.

If your agent is spending its context window on cookie banners and nav menus, that is the problem this shape exists to remove. The full request and response, including the other endpoints and the error contract, live in the quickstart.

The free tier is 2,000 AIDocuments a month, no credit card. Point it at a URL your agent actually reads and check the economics block against your own numbers. That is the whole pitch: don't take the figure on faith, measure it on your own pages.