Do AI Models Actually Read Your Website's Full HTML?
Most don't — and what they miss affects how accurately they understand your content.
Most don't — and what they miss affects how accurately they understand your content.
There's a common assumption in how people think about AI and the web: that when an AI visits your page, it reads it the same way a browser does. It doesn't. What an AI system actually receives when it fetches a webpage is often very different from what you see in your browser — and that gap has real consequences for how well AI understands your content.
This piece looks at how AI systems ingest web content, where they lose information in the process, and why some of the most popular fixes only partially solve the problem.
How AI Systems Actually Read Web Pages
Most AI systems that answer questions based on web content use a process called Retrieval-Augmented Generation (RAG). The basic flow: a question comes in, the system fetches relevant web pages, converts them to plain text, cuts that text into smaller chunks, and feeds the most relevant chunks to the AI as context.
The conversion step is where things start to go wrong. Popular tools like LangChain and Apify's RAG Web Browser strip HTML down to raw text by default. The logic sounds reasonable — AI models are trained on text, not markup, so why keep the tags? But HTML isn't just a wrapper for text. It carries meaning.
A 2025 research paper called HtmlRAG, published by researchers at Renmin University and Tencent, tested this directly. They found that AI models given cleaned HTML consistently outperformed models given plain text on question-answering tasks. The reason: tags like <h1>, <table>, and <a> encode relationships between pieces of content. Strip them out, and you don't get a cleaner version of the page — you get a flatter, less meaningful one.
A table of product specs means something different from the marketing paragraph above it. A code block has a specific relationship to the explanation around it. When those structural signals disappear, the AI is left guessing at context that used to be explicit.
JavaScript Makes It Worse
A bigger problem is that most AI crawlers can't run JavaScript at all — and a huge portion of the modern web depends on JavaScript to display its actual content.
Think of any site built with React, or a product catalog that loads dynamically, or reviews that appear after the page loads. The HTML the server sends on first request is often just an empty shell. The real content only appears after JavaScript runs. For most AI crawlers, that content is invisible.
A 2025 study tracking over half a billion requests from OpenAI's crawler found zero evidence of JavaScript execution. Vercel's analysis found the same pattern for Anthropic's and Perplexity's crawlers. They download JavaScript files — they just don't run them.
Google is the exception. Its AI crawler (used for Gemini) runs a full headless browser that actually executes JavaScript, sharing the same rendering infrastructure as Googlebot. Even so, it caches JavaScript resources for up to 30 days and deprioritizes some pages, so even Google's view can be stale.
For everyone else: if your content loads via JavaScript, AI systems largely can't see it. A simple test — disable JavaScript in your browser and reload your page. That's roughly what most AI crawlers see. This is exactly why server-side rendering has made such a strong comeback.
Cloudflare's Markdown Fix Helps, But Not Completely
In early 2026, Cloudflare announced Markdown for Agents, a feature that converts HTML to Markdown before delivering it to AI systems. It works automatically at the network edge — no changes needed on your site.
The token savings are significant. One test showed a page dropping from ~16,000 tokens as HTML to ~3,150 as Markdown — an 80% reduction. That directly cuts AI API costs and speeds things up. We break down exactly where those costs accumulate in our analysis of RAG pipeline token waste.
But Markdown conversion is still lossy. It drops custom data attributes, ARIA labels, and the semantic tags that help distinguish, say, navigation links from body content. Structural nuance that existed in the HTML doesn't survive the trip to Markdown.
More fundamentally, it doesn't address the JavaScript problem at all. If a page's content loads client-side, converting the server's HTML response to Markdown just produces a very efficient empty document. Google's John Mueller raised similar concerns about whether Markdown actually serves AI systems better than well-structured HTML. The feature is a genuine improvement for the right use cases — but it's an optimization, not a rethink.
What Gets Lost When You Strip Too Much
Weaviate and Pinecone have both published research on how cutting documents into chunks affects AI retrieval quality. The tradeoff is straightforward: smaller chunks are easier to retrieve but lose surrounding context; larger chunks preserve context but become noisy.
The problem gets worse when the structure that tied everything together has already been stripped. A comparison table in plain text might just look like a list of numbers — with column headers somewhere else in a different chunk. A FAQ section becomes a wall of paragraphs with no clear signal of what's a question and what's an answer. The AI sees text that was once organized, without any of the organization.
Anthropic tried to address this with contextual retrieval in 2024 — a method that uses an AI model to generate a short context summary for each chunk before storing it. It helps, but it's expensive and treats the symptom rather than the cause. Our piece on context windows vs. retrieval quality covers this tradeoff in depth.
The HtmlRAG paper is blunt about the failure modes: tables get scrambled, code gets separated from its explanations, links that would clarify meaning get dropped. The AI ends up reconstructing what a document meant from fragments, which leads to confident wrong answers or vague non-answers.
The Real Fix Is Infrastructure, Not Formatting
The underlying issue isn't which text format AI systems receive. It's that the entire pipeline — from fetching a page to feeding it to a model — was built to extract text, not to preserve meaning.
What a proper AI delivery layer needs to do is different: render JavaScript reliably so dynamic content is actually visible; keep semantic structure intact so relationships between content elements survive; and deliver all of it in a form AI agents can reason over accurately.
Projects like MachineContext approach this as an infrastructure problem rather than a formatting one. The question isn't how to make pages smaller for AI — it's what information needs to survive the pipeline for AI to get things right. An AI agent that misreads a price, confuses product variants, or misses a key specification doesn't just give a worse answer. In agentic workflows, it makes wrong decisions downstream.
Will This Just Fix Itself as AI Improves?
Newer AI models with much larger context windows can theoretically process full pages without chunking, which removes one layer of the problem. And there's real commercial pressure on AI companies to improve how they access web content.
But the trajectory is slower than the problem. JavaScript-heavy sites aren't going away. AI agents are being deployed in real business workflows right now, with web access quality that hasn't caught up. And better models don't automatically fix bad inputs — they tend to produce more confident wrong answers when the retrieved content is incomplete or scrambled.
The web spent thirty years building delivery infrastructure for browsers. AI needs its own equivalent — not as an afterthought bolted onto existing pipelines, but as a deliberate layer designed around what AI systems actually need to reason accurately about what's on the page.
Built for this problem
Control exactly what AI reads on your site
MachineContext serves clean, structured content to AI bots — JavaScript rendered, properly formatted, always accurate — while keeping your site unchanged for humans.
Get started →