·5 min read

Paywalls vs. AI Crawlers: What's Actually Happening

Publishers have invested decades in paywalls. The New York Times, The Wall Street Journal, The Financial Times, CNBC — they built subscription models because they needed recurring revenue to fund quality journalism. Paywalls work by restricting access. You pay, you read. You don'

AI systems are accessing, reconstructing, and sometimes bypassing gated content — and publishers are just starting to respond.

Publishers have invested decades in paywalls. The New York Times, The Wall Street Journal, The Financial Times, CNBC — they built subscription models because they needed recurring revenue to fund quality journalism. Paywalls work by restricting access. You pay, you read. You don't pay, you hit a meter and access stops. It's a simple control: access is gate-kept by payment.

AI crawlers don't have a payment relationship with publishers. They're not subscribers. So what happens when they visit a paywalled site?

The answer is complicated and involves three different mechanisms, each with different business consequences.

Three Ways AI Systems Access Paywalled Content

Traditional web crawlers like Googlebot and bingbot respect the robots.txt file. It's a simple text protocol from 1994 that says "don't visit these pages." Crawlers choose to respect it because the alternative — being blocked by publishers — was worth avoiding. This has held for thirty years.

But AI systems aren't playing by thirty-year-old rules. When Googlebot hits a robots.txt that disallows crawling, it stops. When an AI browsing agent like OpenAI's Atlas approaches the same page, it might look identical to a regular Chrome browser, complete with JavaScript execution, cookie handling, and user agent strings that are indistinguishable from a person reading on their laptop. The technology for AI to bypass bot detection has matured to the point where detection itself is nearly impossible. The paywall sees a user, not a bot. It lets them through.

The second mechanism is reconstruction. Even without direct access to paywalled articles, AI systems don't need the original content. They can reconstruct it from secondhand sources. Social media discussion, comment threads, academic citations, news aggregators, and forums all discuss paywalled articles. People quote them. Researchers cite them. From these fragments, an AI system trained on a broad web corpus can often reconstruct enough of the original article to answer questions about it. Research found that AI systems successfully reconstructed about 50% of paywalled content from major publications using this method.

The third mechanism is historical access. Common Crawl is a non-profit that archives the public web, crawling billions of pages and making them available for research and AI training. It has become the foundational dataset for AI model training. In late 2025, Common Crawl faced criticism for "quietly funneling paywalled articles to AI developers". Articles that were once publicly accessible, crawled years ago before they were gated, remained in the archive. AI companies used those archived versions for training, accessing content that's now paywalled.

All three mechanisms work at scale. Together, they mean that paywalls are porous to AI.

The Business Impact

Publishers are feeling the effects. The New York Times saw referral traffic drop 27.4% in Q2 2025. CNBC lost 10-20% of search-driven traffic. Even more striking, Google AI Overviews have reduced referral traffic by up to 70% for some publishers. When Google answers the user's question directly in its search interface, why does the user need to click through to the publisher's site?

These aren't rounding errors. For a publication like CNBC, search-driven traffic is a significant portion of monthly visitors. A 10-20% drop is a material impact to ad inventory and to subscriber acquisition.

The tension is real: paywalled content is valuable exactly because it's exclusive. Once AI systems have access to it, the exclusivity evaporates. The content becomes available through ChatGPT or Perplexity or any other system that crawled it. The paywall model breaks down.

The Emerging Responses

Some publishers are fighting back. Cloudflare, which hosts massive amounts of web content, introduced what they call "pay per crawl" in mid-2025 — a system where content owners can set pricing for AI crawler access, essentially charging AI companies for the privilege of indexing their content. This gives publishers a lever: either block AI crawlers entirely, or charge them. The approach has been adopted by major publishers including Condé Nast, The Associated Press, The Atlantic, and Gannett.

The upside is clear: publishers get compensated for content that AI systems use. The downside is equally clear: AI systems can choose to pay, block the publisher's content entirely, or use reconstruction tactics to access it anyway without paying. It's a negotiation, and most publishers don't have the leverage that a major newspaper with unique, breaking reporting has.

Some AI companies are striking direct deals with publishers. OpenAI has reached licensing agreements with several news organizations. These agreements are still early and uneven. Some publishers are well-positioned to negotiate favorable terms. Others lack the scale or exclusivity to have much leverage. The big picture is still forming.

The unlicensed consumption continues in parallel. Reconstruction still works. Older archived versions still train models. The problem doesn't have a clean solution yet.

What This Means for Visibility

For publishers and media companies, the paywall question is existential. But for other kinds of businesses and creators, the paywall battle has a different implication. It shapes the knowledge landscape that AI systems have access to.

As more premium content goes dark to AI crawlers — either blocked via robots.txt, paywalled entirely, or protected by legal action — AI systems trained on public data will have increasingly uneven knowledge. Breaking news from a major publication might not be visible to ChatGPT if that publication has blocked the OpenAI crawler. Expert analysis from a paywalled newsletter won't train future models if it's protected.

This creates an opportunity for creators and companies willing to publish openly. Content you publish in AI-readable form — in structured markdown, with clear semantics, accessible to crawlers — becomes relatively more visible by default. This doesn't mean everything should be free. But for businesses built on visibility, there's a strategic advantage to the content you do publish being maximally accessible and readable by AI systems.

It also means the web is fragmenting. Public content, paywalled content, AI-crawlable content, AI-blocked content. Different AI systems have access to different subsets of the internet depending on their crawling choices and licensing agreements. This uneven access is exactly why AI search operates on very different signals than traditional search.

Publishers are right to protect their content. But as they do, they're also reshaping what the public internet looks like from an AI perspective. The question for everyone else is what that fragmentation means for visibility, discoverability, and the systems that consume published information going forward.

Built for this problem

Control exactly what AI reads on your site

MachineContext serves clean, structured content to AI bots — JavaScript rendered, properly formatted, always accurate — while keeping your site unchanged for humans.

Get started →