July 5, 2026 · publishers

The publisher's guide to AI bots, robots.txt, and getting represented correctly

Who reads your site now, what a well-behaved AI-index bot looks like, how robots.txt and llms.txt apply, and how to verify your domain with Lyrenth for free.

Your traffic logs used to answer one question: which search crawlers came by, and how often. That question is now incomplete. A growing share of the machines reading your pages are not building a search results page; they are answering a question for a person, right now, inside an assistant or an agent. If you publish on the open web, those readers are already part of your audience.

This is a practical guide to that shift: who is reading your site, what a well-behaved AI-index bot should look like, how the old signals (robots.txt) and the new ones (llms.txt) apply, and how to make sure the version of your content that reaches an assistant is the version you authored. Lyrenth's own crawler is the worked example throughout, because we publish exactly how it behaves.

Who is reading your site now

Search engines still crawl; that has not changed. What is new sits alongside them: assistants and agents that fetch a specific page because a user asked a specific question, and indexers that build a clean, machine-readable copy of the web so those assistants do not each crawl you separately.

These readers do not want your layout. They do not render your hero image, click your nav, or see your cookie banner the way a person does. They want the text, the structure, and the provenance: what the page says, how it is organized, and where it came from. A well-designed AI reader strips the rest.

That has a direct consequence. If a page reads well as plain structured text, with a real title, a sensible heading order, and clean markup, it represents you well to these readers. If the meaning only emerges after JavaScript runs, or the important content is buried under interstitials, the machine reader gets a worse version than a human does. The gap between "how my page looks" and "how my page reads" is the thing to manage now.

What a well-behaved AI-index bot looks like

Not every bot that fetches your pages is polite, and not every polite bot tells you who it is. So it is worth being concrete: a responsible AI-index crawler does four things.

It identifies itself honestly, with a stable name. No browser impersonation, no rotating user-agent strings that make it hard to tell one visit from a thousand. Lyrenth's autonomous crawler always presents the same user agent:

AIWebIndex/2.0 (+https://lyrenth.com/bot; AI-readable web index)

When a Lyrenth customer submits a specific URL through the API, that fetch is a separate, user-directed action, and it carries a distinct user agent so your logs can tell them apart:

AIWebIndex-Agent/2.0 (+https://lyrenth.com/bot; user-initiated fetch)

That separation matters. The autonomous crawler is Lyrenth choosing what to fetch and when; a user-initiated fetch is a specific caller asking for your URL through our API, in the same posture as a link previewer or a browser following a click.

It honors robots.txt per RFC 9309. RFC 9309 is the formal specification of the robots.txt rules the web has used informally for decades. Lyrenth's autonomous crawler implements it: it honors Disallow rules for its user agent, respects Crawl-delay, follows Sitemap: directives, and backs off on HTTP 429 and 503 responses. It fetches your robots.txt at most once every 24 hours, and if it cannot read it and has no cached copy, it treats your site as disallowed, stricter than the specification requires. It fails closed, not open.

It stays polite even when you have not told it to. A good crawler keeps a rate floor on every origin so it never becomes a load problem. Lyrenth holds a minimum 2-second cooldown between requests to the same domain, raised further wherever you declare a Crawl-delay. An index crawler is the opposite of a load spike: crawl a page once and serve it to many readers, so a thousand agents on the same URL collapse toward a single fetch against your origin.

It can be verified, so impostors cannot hide behind its name. Anyone can put a known bot's name in a user-agent string, so a trustworthy crawler lets you confirm a request is genuinely theirs. Lyrenth publishes three ways: a Web Bot Auth signature on every request (RFC 9421 HTTP Message Signatures, verifiable against a public key directory), a machine-readable list of published IP ranges at /bot/ip-ranges.json, and forward-confirmed reverse DNS under lyrenth.com on every crawling IP. A request claiming to be Lyrenth from outside those ranges, without a valid signature, is not us.

All of this is stated in full, with the exact strings and steps, on the Lyrenth bot page, the address every +https://lyrenth.com/bot link in those user agents points to.

Lyrenth is an indexer, not a model trainer. We fetch public pages to build a fresh, canonical index that we serve to agents with attribution and a link back to you. Lyrenth does not train foundation models on the content we crawl.

robots.txt still works, and here is the simplest control

To keep an AI-index crawler off your site entirely, the oldest tool still works. Add a block for its user agent to your robots.txt:

User-agent: AIWebIndex
Disallow: /

That stops all future autonomous crawling of your domain. It is prospective: it governs what happens next, not what is already indexed, so removing existing content is a separate request. And because RFC 9309 governs automatic clients rather than user-directed requests, a Disallow for the crawler does not by itself stop a user-initiated API fetch a specific caller triggers; for that, Lyrenth honors a domain-wide opt-out by email. The full menu of controls, and what each one does, is on the crawler policy page.

For most publishers, blocking is not the right move, for the same reason blocking search crawlers rarely is: these readers are how a growing part of your audience finds and quotes you. The point is that the control exists and is honored.

llms.txt in one paragraph

You may have seen llms.txt mentioned as a new file to add. The idea is small and useful: a plain, machine-readable file at the root of your domain that gives an AI reader a compact map of what matters on your site, so it does not have to infer your structure by crawling everything. It is a hint, not an enforcement mechanism: a table of contents written for machines. Lyrenth publishes its own at /llms.txt so you can see the shape. Adding one is a reasonable low-effort step, but it is not where the real leverage is. That is authorship.

Why representation matters more than any single file

Here is the situation you are actually in. AI agents already crawl your pages and extract whatever they can, however they like. Sometimes that is clean. Often it is not: a stale price, a stripped-out table, the wrong opening hours, a summary built from the half of the page that happened to render. You have no say over any of it. The version that reaches an assistant is one a stranger's parser produced.

Verification flips that. When you verify your domain with Lyrenth, you author the canonical AIDocument that every agent receives when it reads you through Lyrenth. Your dashboard shows what AI currently makes of each page, and when something is wrong, you push a correction and the version agents read is fixed in seconds. This is additive, not exclusive: the labs and assistants can still read you directly. What changes is that there is now an authoritative version, controlled by you, and it reaches your origin as a single cached fetch rather than a fresh crawl per agent.

How verification works with Lyrenth

It is free, takes about a minute, and requires no changes to your stack. You prove you control the domain with one of two standard methods:

A DNS TXT record at your DNS provider. The recommended path; usually propagates in under a minute. Our verifier queries several public resolvers in parallel so a slow or misconfigured local resolver does not hold you up.
A .well-known file served over HTTPS at your domain. HTTPS only; plain HTTP fetches are rejected.

Pick whichever propagates first. Once ownership is confirmed, Lyrenth resolves your public pages into AIDocuments and your dashboard fills in. You can start verification from the bot page or head straight to adding a site.

One thing we say plainly because it is the deal: verifying your domain also places it in the Verified Index, the consent-backed subset of the corpus Lyrenth can license to AI labs. There is no revenue share. The free owner toolkit (verification, the dashboard, AI Readiness, corrections, and the change signal) is the consideration. You can leave anytime in settings, though copies already delivered to a licensee cannot be recalled. The full mechanics are in the Terms.

The AI Readiness score is not an SEO score

When your dashboard populates, it shows an AI Readiness score from 0 to 10. Be clear about what this is, because it looks like the kind of number an SEO tool would give you, and it is not.

AI Readiness measures one thing: how cleanly a page can be consumed by a non-browser reader. It is built from signals like whether the page renders as server HTML or hides content behind a JavaScript shell, whether it has a real title and description, whether structured data is present, and whether headings and links are used in a sane, semantic way. It aggregates to a per-domain score.

What it is not: it is not a ranking factor, not a content-quality judgment, and not a homework list to please a search engine. It measures the surface contract between your page and a machine that reads text rather than pixels. A page can be excellent journalism and still score low because it only assembles itself after a script runs; a thin page can score well because its markup is clean. The score answers a narrow, honest question: when a machine reads this page as structured text, how much of the meaning survives?

Lyrenth shows you what it measured on the content it actually serves, not a list of demands. And the fixes it surfaces, exposing content as server HTML, adding a real title and description, including structured data, tend to help every machine reader you have, not just Lyrenth.

Where to start

If you take one action from this: verify your domain. It is free, takes a minute, and moves you from being represented by whatever a parser guessed to authoring the canonical version yourself. Read the Lyrenth bot page to see exactly how our crawler identifies itself, then verify and watch your dashboard fill in. For the deeper picture of why machine readers behave so differently from human ones, and what that inverts about indexing, see the companion piece: agents don't browse, they read.

Reading through Lyrenth as a developer instead? The free tier is 2,000 AIDocuments a month, no card required. Verifying your domain as a publisher is separate, and free forever.