·7 min read

Why AI Agents Keep Getting Stuck on the Web

Multi-step web tasks sound simple. For AI agents, they're a compounding failure problem.

The vision is compelling. You ask an AI agent to "book me a flight from San Francisco to New York on March 15th, find the cheapest option, and show me hotels within walking distance of Central Park." The agent browses kayak.com, searches for flights, extracts prices, sorts by cost, clicks the cheapest result, then navigates to hotels.com and repeats the process. Thirty seconds later, it presents you with a summary: "Here are your options."

Or you ask an AI agent to "review our three competitor's pricing pages and tell me how our product stacks up." The agent visits competitor websites, extracts pricing information, compares it to yours, and delivers a competitive analysis report. No human involved.

This is the promise of agentic AI. These systems can understand natural language instructions, break them into steps, execute those steps autonomously, and handle the unpredictability of a live website that changes mid-task.

The reality is messier. Even with powerful underlying models, complex multi-step web tasks fail at surprisingly high rates.

The Failure Rates

Research across multiple agentic AI platforms reveals consistent patterns of task incompletion. The mean completion rate for complex multi-step web tasks was 75.3% across all platforms tested. Claude Computer Use, one of the strongest agentic systems available, achieved 86% completion on test tasks. OpenAI's Code Interpreter came in at 81%. Older systems like AutoGPT hit 73% and 74% respectively.

These aren't catastrophic failures. 75% success means three out of four times, the agent completes the task. But in a production context where the stakes are real — actually booking a flight, submitting a customer service ticket, or querying a database — a 25% failure rate is unacceptable. You can't deploy a system that fails one in four times without a robust fallback mechanism.

The second problem is the compounding failure in multi-step pipelines. In agentic systems built from chained steps, each step has a success probability, and the overall success probability is the product of all individual probabilities.)

Do the math. If you have a 10-step process where each step has a 98% success rate — which sounds excellent for an individual step — the overall success rate is 0.98^10, which equals roughly 82%. At 20 steps, it's 67%. At 30 steps, it's 54%. Each step that seems fine in isolation contributes to a cascading failure probability that rapidly becomes unacceptable. Research on this problem shows that in multi-agent systems, even tiny per-step failures compound into hard stops or silent errors that are extremely difficult to diagnose and fix.

Why Web Tasks Are Hard

Web tasks are particularly vulnerable to this problem because the web is dynamic, variable, and designed for humans, not machines.

A form that validates asynchronously might fail to register that a field has been filled incorrectly. An AI agent submits the form. The form silently fails. The agent moves on, thinking the step succeeded. An e-commerce site might load product availability via JavaScript. The agent can't see the JavaScript, so it can't verify whether an item is in stock before proceeding. A date picker is built in React and requires specific click patterns to open. The agent clicks the input field and expects a calendar to appear. Instead, nothing happens. The step fails.

Some failures are detectible. If a form submission returns an error message, the agent can read it and retry. But many failures are silent. The page loads but displays stale data. The API was called but returned results from a cache. The session timed out but the page still renders. The agent has to navigate a landscape where apparent success and actual success are often different things.

JavaScript rendering is the fundamental problem. Studies specifically examining agent performance on JavaScript-heavy sites show that agents without JavaScript execution capability fail dramatically more often than agents with it. An agent that can render JavaScript can see dynamically loaded prices, availability flags, form validation states, and modal dialogs. An agent that reads HTML-only sees skeleton screens and empty containers.

Claude Computer Use is effective partly because it actually renders the page as a browser would, taking screenshots and interacting with the rendered viewport. This gives it a true view of the page state. Most other agentic systems don't have this capability. They make HTTP requests, parse HTML, and guess at the page state based on the DOM structure. Guessing is what leads to failures.

The Read-Before-Act Problem

The most common failure point across research on agentic systems is the retrieval step — the moment when the agent reads the current state of the page and builds an internal model of what's happening. If this step is wrong, everything that follows is built on a false premise.

An agent reads a product page and extracts "$49" as the price. But that's yesterday's price, cached in a meta tag. The current price is "$39" and appears in a React component. The agent built its internal model based on stale information. Now it compares prices incorrectly. It makes purchasing decisions on incorrect data. The failure originated in the read, not the act.

This is especially common on modern web applications where content is loaded and updated asynchronously. The agent's first read of the page might capture the initial skeleton. A second later, the actual content loads. But the agent doesn't wait for that. It proceeds based on what it saw in the first read.

For agentic systems to work reliably, they need two things: a view of the page that matches what humans see (which typically requires JavaScript rendering), and a validation loop that checks whether actions had their intended effects. Most current implementations have at best one of these. That's why failure rates remain stubbornly high.

The Current State of Deployment

Agentic AI in production is still heavily restricted. Most deployments use agents in a validation framework: an agent attempts a task, then a human checks the result before it's executed. This is safe but defeats the purpose of autonomy. The agent is faster than a human would be starting from scratch, but a human still has to verify everything.

Fully autonomous agents — where an agent makes decisions and takes actions without human oversight — are deployed in limited contexts: internal tools, low-stakes tasks, or specialized domains where the success rate is already very high because the problem is well-structured.

As underlying infrastructure improves — as JavaScript rendering becomes standard, as agent frameworks add better validation, as failure recovery becomes more sophisticated — autonomous deployment will expand. But we're not there yet. A system that succeeds 75% of the time is not ready for unsupervised real-world use.

The Web Ingestion Layer Matters

All of this traces back to the core problem: how is the web being read? An agent that can't see JavaScript-rendered content is an agent that's working with incomplete information. The quality of what the agent reads directly determines the quality of what the agent does — whether it's making purchasing decisions, extracting prices, or filling out forms.

For organizations deploying agents, this means the sites those agents need to interact with have to be readable. Server-side rendered, with important content visible in the initial HTML. Forms that validate and provide clear feedback. The different capabilities of each AI crawler mean there's no shortcut here.

For website owners, it means that if you want AI agents to interact with your site (whether that's your own internal agents or external agents representing users), you need to think about readability for non-browser clients. Clear structure, explicit content, standard patterns for forms and validation.

The trajectory is clear. Agent adoption in production will grow significantly over the next few years. But it will grow into environments where the underlying web infrastructure supports it. Sites that are hard for agents to read will either become invisible to agentic applications or will see agent failures that cascade into bad user experiences.

The web was built for humans and search engines. It's now being read by AI agents. That's a new constraint, and it's pushing toward the same solution we've seen in every previous evolution: clearer structure, better semantic clarity, server-side rendering of content that matters.

Built for this problem

Control exactly what AI reads on your site

MachineContext serves clean, structured content to AI bots — JavaScript rendered, properly formatted, always accurate — while keeping your site unchanged for humans.

Get started →