·5 min read

The Race to Build Web Standards for AI: robots.txt, llms.txt, ai.txt

The web has had a protocol for managing automated access since 1994. It's called robots.txt. It's a simple text file in the root directory of a domain that tells crawlers which paths they shouldn't visit. The entire system works because crawlers choose to respect it. There's no t

Everyone agrees the web needs rules for AI access. Nobody agrees what those rules should look like.

The web has had a protocol for managing automated access since 1994. It's called robots.txt. It's a simple text file in the root directory of a domain that tells crawlers which paths they shouldn't visit. The entire system works because crawlers choose to respect it. There's no technical enforcement. It's a social contract. Crawler developers decided that respecting publishers' wishes was worth the coordination cost. So they built the feature in.

It worked because both sides wanted it to work. Publishers wanted control over which parts of their site got indexed. Search engines wanted publishers to trust them. The incentives aligned.

In 2025-2026, the web has a new problem: it has too many AI systems visiting it, each with different capabilities and intentions, and no clear framework for managing what they can and can't access. The old social contract is breaking down.

Three Competing Approaches Have Emerged

The simplest response has been to extend robots.txt. Publishers add entries for ChatGPT's crawler (GPTBot), Claude's crawler (ClaudeBot), and Perplexity's crawler to their disallow list. This works, technically. The crawlers, so far, respect it. But it's a blunt instrument. Blocking GPTBot doesn't just prevent OpenAI from training on your content. It also prevents live ChatGPT users from querying your content through the web interface. Some publishers want to allow one and block the other. robots.txt extensions don't support that nuance.

The second approach is llms.txt. Proposed by researcher Jeremy Howard, llms.txt is a curated map of your most important content for AI. In practice, adoption is nearly nonexistent. Most AI companies haven't integrated llms.txt support. It remains a proposal with good intentions and minimal uptake.

The third is ai.txt. Proposed in 2025, ai.txt is more ambitious than either robots.txt extensions or llms.txt. It's a domain-specific language that would let publishers specify granular permissions. You could allow summarization but disallow training. You could permit access to one section of your site but not another. You could ask for attribution or compensation. The idea is to move beyond binary allow/disallow into a richer policy language that captures the real variety of publisher needs. But ai.txt is even further from widespread adoption than llms.txt.

All of these standards are being discussed in detail across the web, but the adoption challenge is identical: they only work if AI companies respect them. robots.txt has thirty years of ecosystem pressure. Search engines compete partly on trustworthiness, and respecting robots.txt is table stakes. llms.txt and ai.txt have none of that pressure. An AI company that ignores them faces no market consequence because no user will ever know what was in that file.

What the Major Platforms Actually Do

Google has been clear: it does not support llms.txt. It continues to use its own crawling and indexing framework based on internal signals and publisher feedback. Google's approach to AI is to integrate it into its own products rather than respect external signaling.

This is the core tension. Google controls so much search volume that if Google ignores a standard, that standard becomes less useful. But Google also has different incentives than smaller AI companies. OpenAI and Perplexity might choose to respect robots.txt extensions to build trust. Google already has enormous trust (or at least acceptance) from publishers, so it can afford to ignore these signals.

The Role of Sitemaps

Some SEO researchers argue that the actual technical baseline that matters hasn't changed. Well-maintained sitemaps combined with fast, server-rendered content remain the most reliable signal for both traditional and AI-based indexing. A sitemap tells search systems what pages exist. Server-rendered content means those pages are readable without JavaScript execution. These technical fundamentals predate all three standards frameworks. They're not new rules. They're foundational plumbing that AI systems still need.

This hints at the deeper gap: all of these standards assume that once an AI system knows where your content is, it can actually read it. That assumption is fragile. A standards file pointing to a JavaScript-rendered page that AI can't render is just a better map to a locked room.

The Missing Layer

A page rendered entirely in JavaScript is invisible to most AI crawlers. A table that's been scrambled by poor HTML parsing is unreadable even if the crawler can access it. Paywalls, robots.txt, llms.txt, ai.txt — none of these address the most common form of invisibility to AI systems: content that's technically accessible but practically unreadable. A table that's been scrambled by poor HTML parsing is unreadable even if the crawler can access it. HTML with terrible semantic structure is harder for LLMs to understand even when they do retrieve it. A page full of boilerplate, navigation, and advertising is expensive to embed and retrieve, and likely to land in the middle of context where attention decays.

You can publish a robots.txt. You can create an llms.txt file. You can set up pay-per-crawl. But if the underlying content is poorly structured, none of it matters. The standards help with the permission layer. The rendering and semantic clarity layer is separate.

The Bigger Picture

The standards race is worth watching because the web eventually converges on norms. It happened with robots.txt across three decades. It happened with SSL/TLS certificates. It happened with sitemaps, with structured data markup, with mobile-friendly design. Standards that serve the interests of both publishers and index builders eventually become universal.

The current landscape is fragmented and mostly non-binding. Robots.txt extensions are working because the incentives align. llms.txt is stillborn. ai.txt is aspirational. But this will consolidate. In a few years, likely one or more of these approaches will become standard, and major AI companies will support it. The question is whether the standards capture the real needs of publishers or whether they remain window dressing while the actual control points — technical rendering, semantic structure, paywall enforcement — remain where they are.

The practical move in the meantime remains straightforward: make your content actually readable, not just theoretically accessible. Render on the server. Structure semantically. Clean the markup. Eliminate boilerplate. These aren't novel recommendations. But they matter more now because they're the foundation that all the standards-level work depends on.

The standards will emerge. But until they're widely adopted and enforced, the real visibility advantage goes to content that's intentionally structured to be legible to both human readers and automated systems. That's not a prediction about future standards. It's a description of what works today.

Built for this problem

Control exactly what AI reads on your site

MachineContext serves clean, structured content to AI bots — JavaScript rendered, properly formatted, always accurate — while keeping your site unchanged for humans.

Get started →