Crawler policy.
Last updated: June 10, 2026
Who we are
Lyrenth is operated by Aleksma AI Inc., a Delaware corporation.
Crawler and index infrastructure are primarily hosted on Hetzner infrastructure in Frankfurt and Falkenstein, Germany.
Contact: hello@lyrenth.com
Two retrieval surfaces
Lyrenth fetches URLs in two distinct ways, and they behave differently. We disclose both here so your server logs always make sense.
1. The Lyrenth crawler (autonomous). Our background indexer chooses what to fetch and when, from sitemaps, link graphs, and prior fetches. Because Lyrenth is the actor, the crawler fully implements RFC 9309: it honors your robots.txt, and a Disallow for our user-agent means we do not crawl those paths. Everything in “What we respect” below applies to this surface.
2. User-initiated API fetches. When a Lyrenth customer submits a specific URL to our API, the customer (not Lyrenth) chooses the URL and the timing, and we act as their retrieval tool, in the same posture as user-initiated archive tools, link previewers, and browsers. By default, robots.txt is the calling customer's compliance responsibility under our Terms, not enforced by Lyrenth, consistent with RFC 9309's scope (it governs automatic clients, not user-initiated requests). These fetches use a distinct User-Agent (see “User-Agents” below) so you can always tell the two apart in your logs.
If a Disallow rule is in place for our crawler, you may still occasionally see user-initiated fetches. That is not the crawler ignoring your robots.txt; it is a specific person or agent requesting your URL through our API. If you want neither surface to touch your domain, use the email opt-out below, which applies to both.
We additionally throttle or refuse API fetches against any origin when aggregate customer traffic risks disrupting it, and we maintain a domain refusal list, honored across all customers, for origins that request it.
What we do
Lyrenth fetches publicly accessible web pages on behalf of AI systems and developers. We:
- parse structured data,
- resolve pages into the canonical AIDocument format,
- and cache AIDocuments to minimize unnecessary repeated requests.
If 1,000 AI agents request the same URL, our goal is to reduce that to a minimal number of origin fetches whenever technically possible. Our cache is designed to reduce total load on your origin, not add to it.
Verify to author your canonical version
AI agents already crawl your pages and extract them however they like. Verifying your domain lets you take the wheel: you author the canonical AIDocument every agent gets when it reads you through Lyrenth, your dashboard shows exactly what AI currently makes of each page, and you can push a change so the version agents read is corrected in seconds.
Verification is also a licensing grant; we say this plainly. Verifying a domain places it in our Verified Index and grants Lyrenth the right to license its canonical content to AI labs and model companies, including for payment. Lyrenth retains the proceeds; there is no revenue share. The free owner toolkit (verification, dashboard, AI Readiness, corrections, change webhook) is the consideration for that grant. The full mechanics, including how to leave and what leaving does and does not undo, are in Terms §8.2.
This is additive, not exclusive. The labs can keep reading you directly. Agents that read through Lyrenth are served from our cache, so that demand reaches your origin as a single fetch rather than a fresh crawl per agent.
What we respect today
These commitments apply to the autonomous crawler (surface 1 above).
Our crawler implements RFC 9309 (the formal robots.txt specification) and honors:
Disallowrules insiderobots.txt(section 6.1, MUST).Crawl-delaydirectives (section 6.2, MAY).Sitemap:directives insiderobots.txt.- HTTP 429 backoff and HTTP 503 temporary unavailability responses.
- Per-domain rate caps with a 2-second cooldown floor, raised by
Crawl-delaywhere declared. - Machine-readable reservations of TDM / AI-training rights: TDM Reservation Protocol meta tags (
tdm-reservation,tdmrep) and robots-metanoai/noimageai/notrain/nomldirectives. These exclude a domain from corpus licensing entirely, worldwide.
If we cannot read your robots.txt at all and have no cached copy, we treat the site as disallowed, stricter than RFC 9309 requires. We fetch a site's robots.txt at most once per 24 hours.
We do not attempt to bypass paywalls, login requirements, authentication systems, or technical access controls, on either surface.
User-Agents
Our autonomous crawler identifies itself as:
User-initiated API fetches identify themselves as:
A Disallow rule for AIWebIndex governs the crawler. Per RFC 9309, user-initiated fetches are outside robots.txt scope; to stop those too, use the email opt-out below.
Verifying it's really us
Anyone can put our name in a User-Agent string. To confirm a request genuinely came from Lyrenth:
- We publish our crawler and fetcher IP ranges in machine-readable form at
https://lyrenth.com/bot/ip-ranges.json, updated whenever our infrastructure changes. - Requests from outside those ranges claiming our User-Agent are impostors; we'd appreciate a report at hello@lyrenth.com.
This also makes firewall decisions precise: allowlist or blocklist our published ranges rather than guessing at Hetzner address space.
Data we extract
We may extract: public HTML, visible page content, OpenGraph metadata, JSON-LD structured data, schema.org metadata, and publicly available semantic markup.
We do not intentionally extract: passwords, authenticated content, private user data, or content behind login walls.
Opt-out options
Different mechanisms do different things. Here is exactly what each one achieves:
robots.txt
Effect: stops all future autonomous crawling of your domain. Prospective only: content already in the index remains until you also request removal (below). Does not govern user-initiated API fetches.
AI/TDM rights reservation
A TDM Reservation Protocol meta tag (tdm-reservation / tdmrep) or a robots-meta noai / notrain directive on your pages.
Effect: excludes your domain from corpus licensing entirely, worldwide, in addition to whatever crawling restrictions you declare. This is the strongest machine-readable signal we honor.
Email request
Domain owners may request a domain-wide opt-out at hello@lyrenth.com.
Effect: the strongest option overall. We stop future crawling, suppress your existing content from the API, search, and future corpus deliveries, and add your domain to the refusal list that applies to user-initiated API fetches as well. Removal is prospective: it cannot recall datasets already delivered to a licensee or affect models already trained. See our Privacy policy for the full removal mechanics.
Firewall blocking
Block our published IP ranges (/bot/ip-ranges.json).
Effect: prevents fetches at the network level, both surfaces. Blunt but absolute.
Verified owner controls
Verified site owners may manage per-path exclusions through dashboard controls where available.
Effect: path-level control over what the index serves and what enters the Verified Index corpus, without removing the whole domain.
Abuse reporting
If you believe our crawler is ignoring robots.txt, causing excessive traffic, hammering your infrastructure, or behaving unexpectedly, contact hello@lyrenth.com. Include log excerpts with timestamps and source IPs if you can; that lets us distinguish our traffic from impostors immediately. We actively monitor abuse reports and aim to respond within 24 hours.
Operational reference
robots.txt: allow our crawler explicitly
(Your rules for other crawlers are unaffected; only add a User-agent: * block if you intend to set policy for all bots.)
Verification methods
Verified domain owners may confirm ownership through DNS TXT verification, meta tag verification, or dashboard verification workflows.