AIDocument v2.0

The canonical grouped JSON envelope every successful POST /v1/aidocument response returns, regardless of how the source page was rendered. Eight top-level groups, stable contract, backwards-compatible additions only, breaking changes earn a new major version. The machine-readable schema (draft-07) is published at /aidocument.schema.json.

{
  "schema": {
    "name":    "AIDocument",
    "version": "2.0",
    "ref":     "aidoc:sha256:7a14e9b2d8f3c901b42e5a77c0f19a34"
  },
  "source": {
    "url":              "https://example.com/page",
    "canonical_url":    "https://example.com/page",
    "fetched_at":       "2026-05-13T...",
    "render_mode":      "static" | "rendered" | "static_after_render_failure",
    "status_code":      200,
    "freshness_policy": "cache_first" | "force_refresh"
  },
  "cache": {
    "status":           "hit" | "miss" | "refreshed" | "stale_revalidated",
    "origin_contacted": false,
    "body_fetched":     false
  },
  "identity": {
    "title":        "...",
    "description":  "...",
    "language":     "en",
    "content_type": "article"
  },
  "content": {
    "markdown": "# ..."
  },
  "structure": {
    "headings":        [{ "level": 1, "text": "...", "id": "..." }],
    "links":           [{ "url": "...", "text": "...", "internal": true, "rel": "..." }],
    "images":          [{ "url": "...", "alt": "..." }],
    "structured_data": { /* parsed JSON-LD blocks, merged into one object */ }
  },
  "signals": {
    "word_count":           1552,
    "reading_time":         7,
    "has_json_ld":          true,
    "heading_hierarchy_ok": true
  },
  "economics": {
    "raw_html_tokens_approx": 121803,
    "output_tokens_approx":   2460,
    "token_savings":          119343,
    "token_savings_percent":  0.978,
    "estimated_cost_usd": {
      "raw_html":   0.366,
      "our_output": 0.0074,
      "savings":    0.3586
    },
    "pricing_basis": {
      "input_price_per_1k_usd": 0.003,
      "model_class":            "frontier-1m-context"
    }
  }
}

schema

name
Always "AIDocument". Constant identifier for the shape; lets multi-version consumers branch on it.
version
Major.minor of the shape this response conforms to. v2.0 is the grouped envelope; bumped only on incompatible layout changes.
ref
Content-addressed identifier (aidoc:sha256:<32 hex>) of the document fingerprint. Stable across cache hits and re-crawls of an unchanged page; changes when markdown, URL, headings, or structured_data change. Use for citation.

source

url
The URL you requested. Echoed verbatim.
canonical_url
Post-redirect, post-canonical-tag URL the page itself claims. Often equal to url; differs when the site uses canonical hints (paginated lists, language variants, etc.).
fetched_at
RFC 3339 UTC timestamp of the underlying snapshot. On cache hits this can be older than the call itself.
render_mode
How we got the bytes: static (HTTP fetch), rendered (headless Chromium escalation), or static_after_render_failure (Chromium failed; we kept the static body).
status_code
Upstream HTTP status at fetch time. From the network event for the main document; never silently misreports a 404 as 200.
freshness_policy
The policy the CALLER requested (echoed back). "cache_first" is the default; "force_refresh" bypasses the cache lookup. The outcome lives in cache.status.

cache

status
What happened on the Lyrenth side for this call. "hit" = served from index; "miss" = cache-first lookup missed, fetched from origin; "refreshed" = force_refresh, fetched from origin; "stale_revalidated" = origin returned 304 Not Modified.
origin_contacted
Whether this specific call resulted in a network request to the origin site. False on hits, true on miss/refreshed/stale_revalidated.
body_fetched
Whether we actually received a body from origin. False on hits and on 304 responses; true on full fetches.

identity

title / description
From the <title> tag and the meta description respectively. Both may be empty if the page omits them.
language
BCP 47 language tag detected from the <html lang> attribute.
content_type
Inferred classification (article, product, listing, profile, ...). Useful for routing agents that branch on document kind.

content

markdown
The page body as cleaned markdown. Navigation, footers, ads, scripts stripped. This is what you feed to an LLM.

structure

headings[]
The heading tree (h1-h6) with optional anchor id. Lets agents jump-link without re-parsing markdown.
links[]
Outbound link graph (url + text + internal flag + rel). Filtered to actual content links; nav/footer chrome excluded.
images[]
Image url + alt for every <img> in the body content.
structured_data
Parsed JSON-LD from every <script type="application/ld+json"> block on the page, merged into a single object. Agents that prefer structured data over markdown read from here.

signals

word_count
Visible-text word count of the cleaned AIDocument body. Agents can use it for context-window planning.
reading_time
Approximate reading time in minutes (word_count / 230).
has_json_ld
True if the page declared any application/ld+json structured data block. Boolean, computed per-call.
heading_hierarchy_ok
True if there is at least one heading, the first is h1 or h2, and no adjacent levels jump by more than 1.

economics

raw_html_tokens_approx
Approximate token count if you sent the raw HTML (with all the noise) directly to a frontier LLM. Computed via a fast tokenizer approximation.
output_tokens_approx
Token count of our cleaned output. Always meaningfully smaller than raw_html_tokens_approx.
token_savings_percent
Fraction (0.0-1.0) saved by using us vs sending raw HTML. Real measurements: 0.873 on Wikipedia, 0.988 on NYT homepage, 0.992 on Stripe API docs.
estimated_cost_usd
Per-call dollar amounts at frontier-model input rates. The savings field is what your finance team cares about.
pricing_basis
The input price and model class assumed for the cost math. Carried with every response so you can re-run the numbers against your own rates.

Contract guarantees

Backward-compatible additions only. New top-level groups and new fields within existing groups may appear over time; existing fields never change shape within a major version.
Breaking changes earn a new major version. v3 would be a new endpoint or accept-header negotiation; existing v2 callers continue working unchanged.
Cache state is observable. cache.status + the two booleans tell you exactly what happened on our side for this call. 304 Not Modified is unambiguous; force_refresh is distinguishable from a regular cache miss.
Render mode is observable. source.render_mode tells you whether the bytes came from static fetch or headless Chromium. You can filter or score by this if your downstream cares.
Status codes are real. source.status_codereflects the network event, not the rendered DOM. A 404 stays a 404 regardless of whether the page's SPA shell renders something.
Economics numbers are computed, not estimated. The token counts are approximate (tokenizer approximation), but the dollar math derives from real token rates configured server-side and surfaced in economics.pricing_basis.
Document refs are stable. schema.ref hashes the non-volatile document fingerprint (URL, title, markdown, structure). A re-crawl of an unchanged page produces the same ref; a re-crawl with changed content produces a new one. Use it for citation.

Index lookup: GET /v1/document + fields=

GET /v1/document?url=... is the index-only lookup: it returns a cached AIDocument if we already have one for that URL and 404s otherwise (never crawls). It supports a fields= projection that trims the response to only the listed top-level keys for callers who want a smaller payload. The resolving endpoint POST /v1/aidocument currently returns the full grouped envelope; per-group projection support on /v1/aidocument is on the roadmap.

Examples

# Full cached AIDocument (default)
GET /v1/document?url=https://example.com/post

# Title + markdown only: what most reading agents need
GET /v1/document?url=https://example.com/post&fields=title,markdown

# Title, description, headings: for routing / classification
GET /v1/document?url=https://example.com/post&fields=title,description,headings

# Lightest answer: title + canonical URL
GET /v1/document?url=https://example.com/post&fields=title,canonical_url

# With economics: numbers reflect the projected payload, not the full doc
GET /v1/document?url=https://example.com/post&fields=title,markdown,economics

Allowed fields

Any of the top-level keys on the cached document:

canonical_url   crawl           description     economics
headings        images          links           markdown
meta            structured_data title           url

Unknown field names return HTTP 400 with aninvalid_fields error and the full allowed-set echoed back, so typos surface immediately. The cached shape is flat; the v2 grouped envelope above is the wire shape that ships through POST /v1/aidocument.

Ready to call the API?

The quickstart walks through auth, your first AIDocument, and error handling.

Quickstart Back to docs