LYRENTH
HomePricingIntegrationsDocsBlogIndex statsAboutContact
July 5, 2026 · agents · infrastructure

Agents don't browse, they read: what an index for machine readers looks like

Humans browse the web through layout and pixels; AI agents read it as text, structure, and provenance. That difference inverts how an index should be built.

A person and an AI agent visit the same web page, and they are not doing the same thing. The person browses: their eyes land on a headline, skip the cookie banner, ignore the nav, follow the layout down to the paragraph that answers their question. The agent has no eyes and no layout. It receives a stream of text, and it reads. That difference sounds small. It is the whole argument for building the web's index differently for machines than we built it for people.

For twenty-five years the web's infrastructure has assumed a human reader at the end of the pipe. Pages are designed to be looked at: columns, images, sticky headers, interstitials, the visual furniture that helps a person orient. Search engines learned to work around all of it, treating the page as a bag of ranked signals pointing a human toward a destination they would then browse themselves. The index existed to route attention. The reading happened in the browser, by a person.

Agents break that assumption at both ends. They do not browse toward a destination and read it there; the reading is the task. And they do not experience layout at all. Feed an agent a raw HTML page and it does not see a clean article with some chrome around it. It sees the chrome and the article as one undifferentiated string: navigation lists, script tags, style blocks, tracking pixels, consent-manager markup, three variants of the same heading for three breakpoints, and somewhere in there, the content. The page was optimized for a reader that renders. The agent is a reader that parses.

What agents actually consume

Strip away the assumption of a human at the end of the pipe and you can name precisely what an agent needs from a page. It needs the text, because the text is what it reasons over. It needs the structure, because headings, lists, links, and tables carry meaning that flat prose loses, and because that structure lets the agent navigate a document the way a person navigates a layout. And it needs the provenance: what this page is, who published it, when it last changed, whether the publisher stands behind this version. None of those three things is pixels. All three are the parts of a page that current infrastructure treats as incidental.

Raw HTML buries the signal in noise, and for an agent the noise is not merely ugly, it is expensive and misleading. Expensive, because every token of navigation and boilerplate is a token the model pays for and a token that crowds the context window. On real pages the ratio is stark: the readable content is a thin layer over a thick base of markup, and measured against the raw HTML of well-known pages the reduction to clean text runs high (in Lyrenth's own measurements, on the order of 87 to 99 percent fewer tokens depending on the page, with the exact figure returned per URL rather than asserted as a global constant). Misleading, because a cookie banner and a sidebar of unrelated links are, to a model, just more text that might be relevant. Clean input is not a cosmetic nicety for a machine reader. It is the difference between reasoning over the content and reasoning over the page's plumbing.

So the job is not "make the web look nice for agents." The web has no look for an agent. The job is to hand the agent the three things it reads (text, structure, provenance) in one shape it can consume without reimplementing a browser. That shape is what we call an AIDocument: a Markdown body, a title and description, the page's headings and links and pre-extracted structured data, and the metadata that says where this came from and when. One clean, stable shape, the same for every reader.

The inversion

Once you take machine reading as the primary use, several long-standing assumptions about indexing turn over.

Render for meaning, not for pixels. A traditional pipeline that cares about pixels renders a page to see how it looks. An index for readers renders a page to recover what it says. Modern sites assemble their content with JavaScript, hide it inside shadow DOM, or wrap it in consent overlays that a naive fetch never gets past, so recovering the meaning still means running a real browser when the page demands one. But the goal of that render is different: not a screenshot, a clean extraction of the words and their structure. You render because the meaning is behind the JavaScript, not because you need to know where the sidebar sits. (The token side of that, why raw markup is so costly to feed a model and how to trim it cleanly, is its own subject: feeding web pages to an LLM without blowing the context window.)

One canonical clean shape, not a thousand bespoke parses. When every agent developer writes their own HTML-to-text extractor, the web gets read a thousand slightly different ways, each brittle in its own places, each breaking when a site ships a redesign. An index inverts that: extract once, well, into a shape with a defined contract, and let every reader consume the same thing. The parsing problem gets solved in one place by people whose job it is, instead of re-solved badly in every agent.

Freshness by signal, not by timer. The old model recrawls on a schedule because it has no other way to know a page changed; it guesses an interval and lives with being stale between guesses. But a publisher knows exactly when their page changed, because they changed it. An index built for readers can take that signal directly: the publisher pushes, and we re-resolve on their change rather than on a clock. A timer is a proxy for freshness. The publisher's own signal is freshness itself.

Consent and provenance as first-class fields, not afterthoughts. For a human browsing, provenance is ambient: they see the domain in the address bar, they recognize the brand, they judge the source with their own eyes. An agent has none of that context unless the index carries it. So who published this, whether they verified the domain, whether the version served is the one they authored, and whether they consented to how it is used all become fields in the document, not vibes in a browser. This is the leg that ties the two sides of the index together. When a publisher verifies their domain, the version an agent reads is the one the publisher authored, and every verification makes the read layer more trustworthy while every read makes verification more valuable. Provenance stops being metadata about the page and becomes part of what the page is, to a reader that cannot see the address bar. (The publisher side of this has its own guide.)

Crawl once, serve many

There is an economic shape to all of this, and it also inverts the old one. When each agent fetches each URL for itself, the same popular page gets crawled thousands of times, once per caller, and every one of those fetches lands on the origin server and costs the caller a full extraction. That is the fetch-per-request model, and it scales badly for everyone: expensive for the caller, punishing for the origin, wasteful in aggregate.

An index collapses that. The first reader to want a URL pays for the origin fetch and the extraction; every reader after that is served the same AIDocument from a shared cache. One crawl, many reads. At any real scale that is cheaper per call than re-fetching per request, and it is gentler on the sites being read, because a million agents wanting the same article become close to one visit to the origin rather than a million. The cache is not a performance tweak bolted onto a scraper. It is the thing that makes an index an index: the work of reading the web is done once and amortized across everyone who needs the result, the same economic logic that made a shared search index beat everyone running their own crawler in the 1990s, now applied to reading instead of ranking.

That economic shape is only possible because of the shape of the document. Because every reader consumes the same canonical AIDocument, the cache is meaningful; a thousand bespoke parses could not share a cache entry. Clean shape and shared economics are the same design decision seen from two angles.

Where this is going

The direction this points is toward the AIDocument as an open contract, not a proprietary blob. A shape that agents read is only useful if it is stable and legible: additive changes only, no silent breaking, published where machines can find it. That is the intent behind treating the document as a public contract and describing it in the open, through surfaces like llms.txt and a normative spec, so that "what an agent should expect when it reads the web" becomes something written down rather than something each vendor keeps to itself.

We are early, and honest about what that means. The index is live and growing; the shape is stable and served today; the freshness-by-signal and consent-canonical mechanics are real for domains that have verified. The wider ambition, a single legible standard for how machines read the web, kept fresh by the publishers who authored the pages, is a direction we are building toward, not a finished thing we are claiming to have delivered. We would rather say that plainly than oversell it.

But the core observation holds regardless of how far along the build is. Agents do not browse. They read. An infrastructure that keeps designing pages for eyes and then apologizing to the machines will keep handing readers a document optimized for the wrong consumer. An index that starts from what a reader actually consumes, text, structure, and provenance, in one clean shape, served once and read by many, is not a better scraper. It is a different thing pointed at a different job.


If you build agents that read the web, you can try the shape on your own URLs. The free tier is 2,000 AIDocuments a month, no card required, and every response carries its own measured economics so you can check the token math on the exact pages you care about. Start with what an AIDocument actually is, or read the quickstart.

All postsRead a URL in 5 minutes