/docs/aidocument
The shape

AIDocument v2.0

The canonical grouped JSON envelope every successful POST /v1/aidocument response returns, regardless of how the source page was rendered. Eight top-level groups, stable contract, backwards-compatible additions only, breaking changes earn a new major version. The machine-readable schema (draft-07) is published at /aidocument.schema.json.

{
  "schema": {
    "name":    "AIDocument",
    "version": "2.0",
    "ref":     "aidoc:sha256:7a14e9b2d8f3c901b42e5a77c0f19a34"
  },
  "source": {
    "url":              "https://example.com/page",
    "canonical_url":    "https://example.com/page",
    "fetched_at":       "2026-05-13T...",
    "render_mode":      "static" | "rendered" | "static_after_render_failure",
    "status_code":      200,
    "freshness_policy": "cache_first" | "force_refresh"
  },
  "cache": {
    "status":           "hit" | "miss" | "refreshed" | "stale_revalidated",
    "origin_contacted": false,
    "body_fetched":     false
  },
  "identity": {
    "title":        "...",
    "description":  "...",
    "language":     "en",
    "content_type": "article"
  },
  "content": {
    "markdown": "# ..."
  },
  "structure": {
    "headings":        [{ "level": 1, "text": "...", "id": "..." }],
    "links":           [{ "url": "...", "text": "...", "internal": true, "rel": "..." }],
    "images":          [{ "url": "...", "alt": "..." }],
    "structured_data": { /* parsed JSON-LD blocks, merged into one object */ }
  },
  "signals": {
    "word_count":           1552,
    "reading_time":         7,
    "has_json_ld":          true,
    "heading_hierarchy_ok": true
  },
  "economics": {
    "raw_html_tokens_approx": 121803,
    "output_tokens_approx":   2460,
    "token_savings":          119343,
    "token_savings_percent":  0.978,
    "estimated_cost_usd": {
      "raw_html":   0.366,
      "our_output": 0.0074,
      "savings":    0.3586
    },
    "pricing_basis": {
      "input_price_per_1k_usd": 0.003,
      "model_class":            "frontier-1m-context"
    }
  }
}

schema

  • name

    Always "AIDocument". Constant identifier for the shape; lets multi-version consumers branch on it.

  • version

    Major.minor of the shape this response conforms to. v2.0 is the grouped envelope; bumped only on incompatible layout changes.

  • ref

    Content-addressed identifier (aidoc:sha256:<32 hex>) of the document fingerprint. Stable across cache hits and re-crawls of an unchanged page; changes when markdown, URL, headings, or structured_data change. Use for citation.

source

  • url

    The URL you requested. Echoed verbatim.

  • canonical_url

    Post-redirect, post-canonical-tag URL the page itself claims. Often equal to url; differs when the site uses canonical hints (paginated lists, language variants, etc.).

  • fetched_at

    RFC 3339 UTC timestamp of the underlying snapshot. On cache hits this can be older than the call itself.

  • render_mode

    How we got the bytes: static (HTTP fetch), rendered (headless Chromium escalation), or static_after_render_failure (Chromium failed; we kept the static body).

  • status_code

    Upstream HTTP status at fetch time. From the network event for the main document; never silently misreports a 404 as 200.

  • freshness_policy

    The policy the CALLER requested (echoed back). "cache_first" is the default; "force_refresh" bypasses the cache lookup. The outcome lives in cache.status.

cache

  • status

    What happened on the Lyrenth side for this call. "hit" = served from index; "miss" = cache-first lookup missed, fetched from origin; "refreshed" = force_refresh, fetched from origin; "stale_revalidated" = origin returned 304 Not Modified.

  • origin_contacted

    Whether this specific call resulted in a network request to the origin site. False on hits, true on miss/refreshed/stale_revalidated.

  • body_fetched

    Whether we actually received a body from origin. False on hits and on 304 responses; true on full fetches.

identity

  • title / description

    From the <title> tag and the meta description respectively. Both may be empty if the page omits them.

  • language

    BCP 47 language tag detected from the <html lang> attribute.

  • content_type

    Inferred classification (article, product, listing, profile, ...). Useful for routing agents that branch on document kind.

content

  • markdown

    The page body as cleaned markdown. Navigation, footers, ads, scripts stripped. This is what you feed to an LLM.

structure

  • headings[]

    The heading tree (h1-h6) with optional anchor id. Lets agents jump-link without re-parsing markdown.

  • links[]

    Outbound link graph (url + text + internal flag + rel). Filtered to actual content links; nav/footer chrome excluded.

  • images[]

    Image url + alt for every <img> in the body content.

  • structured_data

    Parsed JSON-LD from every <script type="application/ld+json"> block on the page, merged into a single object. Agents that prefer structured data over markdown read from here.

signals

  • word_count

    Visible-text word count of the cleaned AIDocument body. Agents can use it for context-window planning.

  • reading_time

    Approximate reading time in minutes (word_count / 230).

  • has_json_ld

    True if the page declared any application/ld+json structured data block. Boolean, computed per-call.

  • heading_hierarchy_ok

    True if there is at least one heading, the first is h1 or h2, and no adjacent levels jump by more than 1.

economics

  • raw_html_tokens_approx

    Approximate token count if you sent the raw HTML (with all the noise) directly to a frontier LLM. Computed via a fast tokenizer approximation.

  • output_tokens_approx

    Token count of our cleaned output. Always meaningfully smaller than raw_html_tokens_approx.

  • token_savings_percent

    Fraction (0.0-1.0) saved by using us vs sending raw HTML. Real measurements: 0.873 on Wikipedia, 0.988 on NYT homepage, 0.992 on Stripe API docs.

  • estimated_cost_usd

    Per-call dollar amounts at frontier-model input rates. The savings field is what your finance team cares about.

  • pricing_basis

    The input price and model class assumed for the cost math. Carried with every response so you can re-run the numbers against your own rates.

Contract guarantees

  • Backward-compatible additions only. New top-level groups and new fields within existing groups may appear over time; existing fields never change shape within a major version.
  • Breaking changes earn a new major version. v3 would be a new endpoint or accept-header negotiation; existing v2 callers continue working unchanged.
  • Cache state is observable. cache.status + the two booleans tell you exactly what happened on our side for this call. 304 Not Modified is unambiguous; force_refresh is distinguishable from a regular cache miss.
  • Render mode is observable. source.render_mode tells you whether the bytes came from static fetch or headless Chromium. You can filter or score by this if your downstream cares.
  • Status codes are real. source.status_codereflects the network event, not the rendered DOM. A 404 stays a 404 regardless of whether the page's SPA shell renders something.
  • Economics numbers are computed, not estimated. The token counts are approximate (tokenizer approximation), but the dollar math derives from real token rates configured server-side and surfaced in economics.pricing_basis.
  • Document refs are stable. schema.ref hashes the non-volatile document fingerprint (URL, title, markdown, structure). A re-crawl of an unchanged page produces the same ref; a re-crawl with changed content produces a new one. Use it for citation.

Index lookup: GET /v1/document + fields=

GET /v1/document?url=... is the index-only lookup: it returns a cached AIDocument if we already have one for that URL and 404s otherwise (never crawls). It supports a fields= projection that trims the response to only the listed top-level keys for callers who want a smaller payload. The resolving endpoint POST /v1/aidocument currently returns the full grouped envelope; per-group projection support on /v1/aidocument is on the roadmap.

Examples

# Full cached AIDocument (default)
GET /v1/document?url=https://example.com/post

# Title + markdown only: what most reading agents need
GET /v1/document?url=https://example.com/post&fields=title,markdown

# Title, description, headings: for routing / classification
GET /v1/document?url=https://example.com/post&fields=title,description,headings

# Lightest answer: title + canonical URL
GET /v1/document?url=https://example.com/post&fields=title,canonical_url

# With economics: numbers reflect the projected payload, not the full doc
GET /v1/document?url=https://example.com/post&fields=title,markdown,economics

Allowed fields

Any of the top-level keys on the cached document:

canonical_url   crawl           description     economics
headings        images          links           markdown
meta            structured_data title           url

Unknown field names return HTTP 400 with aninvalid_fields error and the full allowed-set echoed back, so typos surface immediately. The cached shape is flat; the v2 grouped envelope above is the wire shape that ships through POST /v1/aidocument.

Ready to call the API?

The quickstart walks through auth, your first AIDocument, and error handling.