A Bigger Context Window Doesn't Fix Bad Retrieval
The context window keeps growing. GPT-4 operates at 128,000 tokens. Gemini reaches 1 million. Claude can handle 200,000. With windows this large, a common thought follows: why retrieve at all? Why not just load the entire document, the entire codebase, the entire knowledge base i
The "just give the model more text" approach to RAG has real limits — and the research shows where they are.
The context window keeps growing. GPT-4 operates at 128,000 tokens. Gemini reaches 1 million. Claude can handle 200,000. With windows this large, a common thought follows: why retrieve at all? Why not just load the entire document, the entire codebase, the entire knowledge base into context and let the model read it?
This is seductive because it sidesteps the hard problem. Retrieval is hard. You have to decide what's relevant, build a ranking system, manage vector embeddings, monitor quality. It's much simpler to just hand the model a big stack of documents and let it figure it out. As context windows get larger, this temptation grows stronger.
The problem is that larger context windows don't actually solve the fundamental issue. They just disguise it.
The Lost in the Middle Problem
In 2024, researchers from Stanford and MIT published a foundational paper on what they call the "Lost in the Middle" problem. They tested modern LLMs — including frontier models — with relevant information placed at different positions within long contexts. The finding was consistent and striking: performance degrades by 30% or more when relevant information appears in the middle of context compared to when it appears at the start or end.
This isn't a small quirk. This is a systematic property of how attention mechanisms work in modern large language models. The architecture uses something called Rotary Position Embedding (RoPE), which introduces a long-term decay effect. Information at the edges of context — the beginning and the end — naturally receives more attention weight than information in the middle. It's not that the model can't understand middle content. It's that the model's attention is designed to focus on edges.
The mechanism matters because it means that bigger windows are a gamble. If your relevant information lands at the beginning or end of what you put in context, great — performance is fine. But if it lands in the middle, which is what happens when you naively stuff documents into a window without retrieval, you're paying for massive latency while getting degraded performance.
The Benchmarking Data
The decay effect isn't theoretical. Real benchmarks show where the cliff is. Testing on recent models found that Llama-3.1-405B starts degrading after 32,000 tokens in context. GPT-4-0125-preview starts degrading after 64,000 tokens. These are the frontier models with the largest advertised context windows. And their practical limits are well below their advertised maximums.
The same benchmarks tested what happens when you increase the number of retrieved passages in a RAG system. The finding was counterintuitive: performance initially improves, then declines. Adding more passages gives the model more opportunities to find the right answer. But past a certain point, adding more passages adds noise faster than it adds signal. The degradation is actually more pronounced with high-quality retrievers. You'd think a better retriever would mean more retrieved passages are useful, but the opposite is true. A high-quality retriever is already returning the most relevant passages first. Appending more passages just pushes the good information toward the middle of a longer context, where attention decays.
The Speed Factor
There's another cost to the "just send more text" approach: latency. Retrieval augmented generation systems average about one second per query. A system that loads entire documents into context without retrieval averages around 45 seconds. The difference isn't marginal. It's almost a 50x slowdown.
This matters in production. A chatbot that takes 45 seconds to respond is unusable for most applications. The user has abandoned the chat and moved on. A system that returns in one second stays in the interaction loop. Speed isn't just a nice-to-have. It's part of the UX and part of what makes certain approaches practical.
The Retrieval Quality Lever
The implication of all this research points to a simple conclusion: for most real applications, retrieval quality matters more than context size. Getting the right 2,000 tokens into context beats getting 50,000 tokens of mixed relevance.
This is where the problem compounds. If your retrieval system is pulling from poorly structured source material — HTML that wasn't cleaned, tables that were scrambled during parsing, semantic relationships that were lost — then no context window compensates for that. A big window is just a bigger room filled with the same degraded input.
This brings us back to the pipeline. The token cost of a badly built RAG pipeline showed how much waste happens when source material isn't properly structured. That waste isn't just about cost. It's about quality. If your source documents are noisy, your retrieved documents will be noisy. If your retrieved documents are noisy, they'll appear in the middle of context where attention decays. You lose on speed, you lose on cost, and you lose on accuracy.
The Right Strategy
Context windows are genuinely useful and getting bigger is genuinely good. But the bottleneck for most applications isn't the size of the window. It's what goes into the window. A system optimized for retrieval quality — with clean source material, intelligent chunking, and precise ranking — will outperform a system that optimizes for window size while ignoring retrieval.
The right approach combines both. Use the large context window, but use it strategically. Retrieve precisely, put the most relevant passages first (so they appear at the edges of attention weight), and trust that you don't need to fill the entire window. The model will use what it needs and ignore what it doesn't, but it will use it more effectively when the signal-to-noise ratio is high.
This is harder than just loading documents. It requires intentionality about what goes into the system and how it's structured. But it's the difference between a system that works well and a system that throws compute at the problem and hopes for the best.
The Implication for Web Content
If you're building a product that crawls or consumes web content, this matters to your architecture. Web pages are often poorly structured for consumption by AI systems — JavaScript-heavy, boilerplate-laden, semantically scrambled. If you ingest that material as-is and load it into an LLM's context, you're at the mercy of the middle-of-context decay problem. The fix isn't context size. It's cleaning the source material before retrieval happens — converting HTML to structured Markdown, extracting the semantic components of pages, removing boilerplate, and storing only what's actually informative.
Built for this problem
Control exactly what AI reads on your site
MachineContext serves clean, structured content to AI bots — JavaScript rendered, properly formatted, always accurate — while keeping your site unchanged for humans.
Get started →