AIDocument v2.0
The canonical grouped JSON envelope every successful POST /v1/aidocument response returns, regardless of how the source page was rendered. Eight top-level groups, stable contract, backwards-compatible additions only, breaking changes earn a new major version. The machine-readable schema (draft-07) is published at /aidocument.schema.json.
{
"schema": {
"name": "AIDocument",
"version": "2.0",
"ref": "aidoc:sha256:7a14e9b2d8f3c901b42e5a77c0f19a34"
},
"source": {
"url": "https://example.com/page",
"canonical_url": "https://example.com/page",
"fetched_at": "2026-05-13T...",
"render_mode": "static" | "rendered" | "static_after_render_failure",
"status_code": 200,
"freshness_policy": "cache_first" | "force_refresh"
},
"cache": {
"status": "hit" | "miss" | "refreshed" | "stale_revalidated",
"origin_contacted": false,
"body_fetched": false
},
"identity": {
"title": "...",
"description": "...",
"language": "en",
"content_type": "article"
},
"content": {
"markdown": "# ..."
},
"structure": {
"headings": [{ "level": 1, "text": "...", "id": "..." }],
"links": [{ "url": "...", "text": "...", "internal": true, "rel": "..." }],
"images": [{ "url": "...", "alt": "..." }],
"structured_data": { /* parsed JSON-LD blocks, merged into one object */ }
},
"signals": {
"word_count": 1552,
"reading_time": 7,
"has_json_ld": true,
"heading_hierarchy_ok": true
},
"economics": {
"raw_html_tokens_approx": 121803,
"output_tokens_approx": 2460,
"token_savings": 119343,
"token_savings_percent": 0.978,
"estimated_cost_usd": {
"raw_html": 0.366,
"our_output": 0.0074,
"savings": 0.3586
},
"pricing_basis": {
"input_price_per_1k_usd": 0.003,
"model_class": "frontier-1m-context"
}
}
}schema
nameAlways "AIDocument". Constant identifier for the shape; lets multi-version consumers branch on it.
versionMajor.minor of the shape this response conforms to. v2.0 is the grouped envelope; bumped only on incompatible layout changes.
refContent-addressed identifier (aidoc:sha256:<32 hex>) of the document fingerprint. Stable across cache hits and re-crawls of an unchanged page; changes when markdown, URL, headings, or structured_data change. Use for citation.
source
urlThe URL you requested. Echoed verbatim.
canonical_urlPost-redirect, post-canonical-tag URL the page itself claims. Often equal to url; differs when the site uses canonical hints (paginated lists, language variants, etc.).
fetched_atRFC 3339 UTC timestamp of the underlying snapshot. On cache hits this can be older than the call itself.
render_modeHow we got the bytes: static (HTTP fetch), rendered (headless Chromium escalation), or static_after_render_failure (Chromium failed; we kept the static body).
status_codeUpstream HTTP status at fetch time. From the network event for the main document; never silently misreports a 404 as 200.
freshness_policyThe policy the CALLER requested (echoed back). "cache_first" is the default; "force_refresh" bypasses the cache lookup. The outcome lives in cache.status.
cache
statusWhat happened on the Lyrenth side for this call. "hit" = served from index; "miss" = cache-first lookup missed, fetched from origin; "refreshed" = force_refresh, fetched from origin; "stale_revalidated" = origin returned 304 Not Modified.
origin_contactedWhether this specific call resulted in a network request to the origin site. False on hits, true on miss/refreshed/stale_revalidated.
body_fetchedWhether we actually received a body from origin. False on hits and on 304 responses; true on full fetches.
identity
title / descriptionFrom the <title> tag and the meta description respectively. Both may be empty if the page omits them.
languageBCP 47 language tag detected from the <html lang> attribute.
content_typeInferred classification (article, product, listing, profile, ...). Useful for routing agents that branch on document kind.
content
markdownThe page body as cleaned markdown. Navigation, footers, ads, scripts stripped. This is what you feed to an LLM.
structure
headings[]The heading tree (h1-h6) with optional anchor id. Lets agents jump-link without re-parsing markdown.
links[]Outbound link graph (url + text + internal flag + rel). Filtered to actual content links; nav/footer chrome excluded.
images[]Image url + alt for every <img> in the body content.
structured_dataParsed JSON-LD from every <script type="application/ld+json"> block on the page, merged into a single object. Agents that prefer structured data over markdown read from here.
signals
word_countVisible-text word count of the cleaned AIDocument body. Agents can use it for context-window planning.
reading_timeApproximate reading time in minutes (word_count / 230).
has_json_ldTrue if the page declared any application/ld+json structured data block. Boolean, computed per-call.
heading_hierarchy_okTrue if there is at least one heading, the first is h1 or h2, and no adjacent levels jump by more than 1.
economics
raw_html_tokens_approxApproximate token count if you sent the raw HTML (with all the noise) directly to a frontier LLM. Computed via a fast tokenizer approximation.
output_tokens_approxToken count of our cleaned output. Always meaningfully smaller than raw_html_tokens_approx.
token_savings_percentFraction (0.0-1.0) saved by using us vs sending raw HTML. Real measurements: 0.873 on Wikipedia, 0.988 on NYT homepage, 0.992 on Stripe API docs.
estimated_cost_usdPer-call dollar amounts at frontier-model input rates. The savings field is what your finance team cares about.
pricing_basisThe input price and model class assumed for the cost math. Carried with every response so you can re-run the numbers against your own rates.
Contract guarantees
- Backward-compatible additions only. New top-level groups and new fields within existing groups may appear over time; existing fields never change shape within a major version.
- Breaking changes earn a new major version. v3 would be a new endpoint or accept-header negotiation; existing v2 callers continue working unchanged.
- Cache state is observable.
cache.status+ the two booleans tell you exactly what happened on our side for this call. 304 Not Modified is unambiguous; force_refresh is distinguishable from a regular cache miss. - Render mode is observable.
source.render_modetells you whether the bytes came from static fetch or headless Chromium. You can filter or score by this if your downstream cares. - Status codes are real.
source.status_codereflects the network event, not the rendered DOM. A 404 stays a 404 regardless of whether the page's SPA shell renders something. - Economics numbers are computed, not estimated. The token counts are approximate (tokenizer approximation), but the dollar math derives from real token rates configured server-side and surfaced in
economics.pricing_basis. - Document refs are stable.
schema.refhashes the non-volatile document fingerprint (URL, title, markdown, structure). A re-crawl of an unchanged page produces the same ref; a re-crawl with changed content produces a new one. Use it for citation.
Index lookup: GET /v1/document + fields=
GET /v1/document?url=... is the index-only lookup: it returns a cached AIDocument if we already have one for that URL and 404s otherwise (never crawls). It supports a fields= projection that trims the response to only the listed top-level keys for callers who want a smaller payload. The resolving endpoint POST /v1/aidocument currently returns the full grouped envelope; per-group projection support on /v1/aidocument is on the roadmap.
Examples
# Full cached AIDocument (default) GET /v1/document?url=https://example.com/post # Title + markdown only: what most reading agents need GET /v1/document?url=https://example.com/post&fields=title,markdown # Title, description, headings: for routing / classification GET /v1/document?url=https://example.com/post&fields=title,description,headings # Lightest answer: title + canonical URL GET /v1/document?url=https://example.com/post&fields=title,canonical_url # With economics: numbers reflect the projected payload, not the full doc GET /v1/document?url=https://example.com/post&fields=title,markdown,economics
Allowed fields
Any of the top-level keys on the cached document:
canonical_url crawl description economics headings images links markdown meta structured_data title url
Unknown field names return HTTP 400 with aninvalid_fields error and the full allowed-set echoed back, so typos surface immediately. The cached shape is flat; the v2 grouped envelope above is the wire shape that ships through POST /v1/aidocument.
Ready to call the API?
The quickstart walks through auth, your first AIDocument, and error handling.