July 5, 2026 · infrastructure · agents

Why your scraper gets empty HTML from JavaScript sites (and what rendering actually takes)

Fetch a React or SPA site and you get a hollow div, not content. Here is why JavaScript sites return empty HTML, and what real rendering takes at index scale.

You write the obvious thing. You curl a URL, or you fire an HTTP request from your scraper, and you expect the page back. Instead you get a shell: a <head> full of meta tags, a <body> with one empty <div>, a <noscript> line telling you to enable JavaScript, and none of the words a human sees when they open that same URL in a browser. The content is simply not in the response.

This is the wall almost everyone hits the first time they try to read the modern web programmatically. It is not a bug in your code; it is how a large slice of the web is built now. This post is about why it happens, and what it actually takes to get the real content out, which turns out to be a much bigger project than "fetch the URL."

The failure, on a real site

Let me not hand-wave this. Here is a genuine, well-known single-page app. Excalidraw is a popular open-source whiteboard tool. Open it in a browser and you get a full drawing app with menus, prompts, and a canvas. Fetch it with curl and here is the entire <body> you get back, trimmed only to collapse the inline bootstrap script:

<body>
  <noscript>You need to enable JavaScript to run this app.</noscript>
  <header><h1 class="visually-hidden">Excalidraw</h1></header>
  <div id="root"></div>
  <script>/* app bootstrap: hydration, CDN redirect */</script>
</body>

That is it. The whole raw document is about 6.9 KB, and the <body> is roughly a kilobyte of that, most of which is script. There is a <div id="root"> and it is empty. The page is politely telling you the truth in that <noscript> line: without JavaScript, there is no app here. Your scraper is JavaScript-disabled by definition, because a raw HTTP fetch does not run scripts. So your extractor, your readability library, your regex over the DOM, whatever came next, has nothing to work with.

Now read the same URL through the Lyrenth reader, which renders the page the way a browser would before extracting. Here is the genuine response header and the top of the body:

# Excalidraw — Collaborative whiteboarding made easy
> Excalidraw is a virtual collaborative whiteboard tool that lets you
> easily sketch diagrams that have a hand-drawn feel to them.
Source: https://excalidraw.com
status 200 · render rendered · 46 words · ~76 tokens · 99% smaller than raw HTML

The title and description are populated. The render field says rendered, meaning the page was executed in a headless browser before the content was pulled, not fetched flat. You get an AIDocument, one clean shape with a Markdown body and structure, instead of a hollow div. (Method: both outputs captured live in July 2026. The raw HTML is the actual curl response for https://excalidraw.com; the rendered line is the economics header the Lyrenth reader returned for the same URL.)

The whiteboard is a deliberately extreme case, almost the entire page is client-rendered, so the contrast is stark. But the same mechanism sits under a huge amount of ordinary content: docs sites, dashboards, marketing pages, product catalogs. When the body arrives empty, this is why.

Why the page needs JavaScript at all

To fix the failure you have to understand it, and there is not one cause. There are several, and they stack.

Client-side rendering. In the classic server-rendered model, the server assembles the full HTML and sends it down complete. In the client-side model, the server sends a near-empty shell plus a JavaScript bundle, and the browser builds the DOM after the bundle loads and runs. That is the <div id="root"></div> you saw. The content exists, but it is assembled on the client, from data the bundle fetches after it boots. A flat HTTP fetch stops before any of that happens.

Hydration and lazy data. Even pages that do send some server-rendered markup often fill the important parts in afterward. The shell paints, the app "hydrates," it makes one or more API calls for the actual data, then it renders that data into the DOM. If you read the HTML the moment it arrives, you catch the page mid-boot: skeletons, spinners, and placeholder boxes where the content will be. Reading too early looks like success, because you get a 200 and some HTML. It is just the wrong HTML.

Shadow DOM and web components. Some sites build their UI out of custom elements that keep their content inside a shadow root. When you serialize that page to HTML the normal way, the shadow trees do not come along, so the markup you capture is structurally present but textually hollow. The tags are there; the words are not. A naive extractor sees valid HTML with almost no readable text and gives up or, worse, returns confident nonsense.

Consent managers and overlays. Even when the content does render, something is often sitting on top of it. A cookie or consent banner injects itself late, frequently from a third-party script, and it can cover the page or hold rendering behind an interaction. Extract at that moment and your "article" is a privacy notice and two buttons. Consent overlays are close to universal on commercial sites now, and they are engineered to be hard to ignore, which is exactly what makes them hard to strip programmatically.

Any one of these turns a simple fetch into an empty or wrong page. In the wild they combine: a client-rendered site, with a consent overlay, that lazy-loads its main content after hydration, that also uses a few web components. That is a completely normal page in 2026.

What real rendering takes

The fix, conceptually, is easy to state: run the page in a real browser, wait until it is actually done, clean up what the browser leaves, then extract. Each of those clauses hides a genuine engineering problem, and we run all of them at index scale, so here is the honest version of what it costs.

You need a headless browser, not a fetcher. To get the content a browser produces, you have to be a browser: load the shell, run the bundle, execute the fetches the app makes, let it paint. That means driving a real rendering engine, not sending an HTTP request. A headless browser is heavy. It uses far more CPU and memory than a fetch, it can crash or hang, and running a fleet of them reliably is a standing operational commitment. This is the single biggest reason "just scrape it" is harder than it looks: the moment you need JavaScript, your per-page cost jumps by a large multiple, and you own a browser-farm reliability problem you did not have before.

You have to decide when a page even needs it. Rendering everything is wasteful, because plenty of pages are still server-rendered and a plain fetch is complete and cheap. So the first real decision is per-page: does this URL need a browser, or is the flat HTML already the whole story? Render too eagerly and you burn browser time on pages that never needed it; render too rarely and you ship hollow pages. The signal is not always obvious from the first bytes, so this is a category of problem, not a one-line heuristic.

Settle timing is the hard part. Once you commit to rendering, you have to decide when the page is done. Fire the extract too early and you capture spinners and skeletons. Wait a fixed long time on every page and you have throttled your whole system to the speed of the slowest page, which at scale is ruinous. The right behavior is adaptive: watch for the signals that the DOM has stopped changing and the content-bearing elements have appeared, and settle as soon as they have, not on a fixed timer. Different sites reach "done" through different paths, so a settle policy that works on one class of page will read another too early. This looks trivial in a demo of one URL and becomes the whole game across many.

You have to strip what the browser leaves behind. A rendered page includes everything the browser drew, and that is more than the content. The consent overlay is now in the DOM, along with the nav, the footer, the newsletter modal, and the sticky share bar. Rendering got you the content; it also got you every piece of chrome around it. So after the render you still have to strip overlays and boilerplate to recover the actual document, and it is adversarial work, because consent tools and modals are designed to be sticky. Web components add another layer: you have to flatten shadow trees so their text is actually present in what you extract, or you are back to hollow markup even after a successful render.

None of these is exotic on its own. The cost is doing all of them, correctly, on every page, at volume, and keeping the browser fleet healthy while you do. That is the difference between a script that reads one page and infrastructure that reads the web.

How Lyrenth handles it so you do not have to

The point of the AIDocument is that all of the above happens on our side, once, and you receive the result. When you read a URL through Lyrenth, the reader decides whether the page needs a browser, renders it when it does, waits for the content to settle rather than on a fixed timer, strips the consent overlays and boilerplate, flattens shadow trees so web-component content is real text, and hands back a clean Markdown body with structure. That is what the render: rendered field in the Excalidraw response meant: this page needed a browser, we ran one, and here is the content a flat fetch could never have given you.

You do not maintain a browser farm, tune settle timers per site, or chase the newest consent-banner script. You send a URL and get an AIDocument back, the same clean shape whether the source was static HTML or a fully client-rendered app. And because reads resolve through a shared index, the render is amortized: the expensive part happens once for a URL, and subsequent reads come back from the cache rather than paying for another headless render.

For the full picture of the shape you get back, the fields and the stable contract, see what an AIDocument is. And if the reason you are reading pages is to feed them to a model, the token side of this, why raw HTML wrecks a context window and how the clean shape fixes it, is covered in how to feed web pages to an LLM without blowing the context window.

The takeaway

An empty <div id="root"> is not a failure of your scraper. It is a modern web page telling you it is built to be run, not fetched. Getting real content out of it means being a browser, deciding when to be one, knowing when the page is done, and cleaning up everything the browser drags in with the content. That is a real infrastructure project, and not the one you set out to build.

The whole idea of the AIDocument is that you get to skip it. Point the reader at a JavaScript-heavy URL and see the rendered marker come back with real content behind it. The free tier is 2,000 AIDocuments a month, no card, so you can test it against the exact SPA that has been handing your scraper a hollow div.