The Token Cost of a Badly Built RAG Pipeline
Every API call to an LLM costs money. The bill arrives per token. A token is roughly four characters, so a single HTML page can mean thousands of tokens, each one adding to your operational cost. Most AI teams understand this intellectually. But many still haven't thought through
Inefficient retrieval isn't just slow — it's expensive. Here's where the waste happens.
Every API call to an LLM costs money. The bill arrives per token. A token is roughly four characters, so a single HTML page can mean thousands of tokens, each one adding to your operational cost. Most AI teams understand this intellectually. But many still haven't thought through how their retrieval system either multiplies or divides that cost.
A retrieval augmented generation pipeline does one simple thing: it fetches relevant documents, compresses them, and sends them to an LLM along with a user question. The LLM then synthesizes an answer based on what it retrieved. The cost of that operation isn't just the LLM inference. It's the entire path: fetching, parsing, embedding, storing vectors, retrieving, and finally prompting the model. And almost every step has waste built in.
The Four Biggest Sources of Waste
Start with what gets fetched. A typical RAG pipeline that scrapes the web to fill a retrieval system doesn't fetch articles. It fetches raw HTML. That means every <div> of navigation markup, every footer with thirty links, every ad tag, every tracking pixel, every bit of boilerplate that makes the page work in a browser gets stored and later retrieved. A single HTML page easily runs 15,000 to 20,000 tokens — as we cover in detail in our breakdown of what AI actually reads. The same content as clean Markdown might be 3,000 tokens. That's an 80% reduction in token count before the LLM ever sees it.
Cloudflare's work on Markdown for Agents made this concrete. They showed that converting web pages to semantic Markdown — removing the structural noise while preserving the actual content structure — cuts token counts dramatically. The implication is straightforward: every pipeline that stores raw HTML is leaving money on the table.
The second source of waste is chunking. When a document goes into a vector database, it's typically split into overlapping chunks — sometimes 512 tokens, sometimes 1024. A poorly designed chunking strategy creates too many chunks. Those chunks get embedded (which costs money), stored (which costs storage), and later retrieved (more costs). If your chunking logic doesn't account for semantic boundaries, you end up with chunks that are redundant or that split conceptual units across multiple fragments. The retriever then drags back all of them, adding noise without signal.
Third is retrieval count. Many pipelines default to retrieving the top-K results for some arbitrary K — often 10, sometimes 20. The assumption is that more context is always better. It isn't. Research from 2024 shows that increasing the number of retrieved passages initially improves performance, then causes decline. The team found this degradation is actually more pronounced with high-quality retrievers. You're paying to retrieve and to include chunks that actively harm the quality of the answer.
The fourth is redundant conversion. Every RAG team invents the HTML-to-text problem from scratch. Someone writes a script to strip tags, handle Unicode, convert tables to markdown. A different team does the same thing with slightly different rules. There's no standardization, no shared component. The work duplicates across thousands of codebases, and the inconsistency means some pipelines handle edge cases better than others. That unevenness cascades — clean input means the embedding model has cleaner training signal, the retriever makes better decisions, and the LLM gets better context.
The Numbers Actually Add Up
Here's the real calculation. A RAG system might reduce prompt sizes by 70% compared to stuffing raw documents into context. That's enormous. But only if the retrieval system itself is efficient.
In production, the cost breakdown of a typical RAG system breaks down roughly like this: embedding generation accounts for 40-60% of total cost, vector storage for 20-35%, and LLM inference for 15-25%. This is according to industry analysis of operational costs. The embedding cost dominates because you're embedding not just the query but every chunk in your entire corpus to build the vector index, then again at retrieval time.
If your chunks are noisy, you're paying to embed noise. If you're retrieving too many chunks, you're paying to embed more than you need. If your HTML hasn't been cleaned, you're embedding structural markup that contains no information relevant to any question a user might ask.
But cost isn't the only lever. Many teams have reported that effective caching strategies — storing the embeddings and results of common queries — reduce monthly token costs by 42% on average. Batch processing requests instead of hitting the API one at a time cuts costs another 30-40%. Those are multiplicative improvements on top of pipeline efficiency.
The Hidden Cost: Accuracy
There's a subtler cost, harder to measure but no less real. A badly optimized pipeline doesn't just waste money. It produces worse answers.
When your retrieval system sends 15,000 tokens of noisy HTML to the LLM when 2,000 tokens of clean content would have sufficed, you're not just paying more. You're scattering the model's attention. The LLM has to sift through structural noise to find the signal. Its fixed attention budget is divided across more noise.
This isn't speculation. Stanford and MIT researchers published work in TACL 2024 on the "Lost in the Middle" problem. They tested modern LLMs with relevant information placed at different positions within long contexts. The result: performance degrades by 30% or more when relevant information ends up in the middle of a long context window. We explore this phenomenon in depth in A Bigger Context Window Doesn't Fix Bad Retrieval.
This means that adding chunks to your retrieval doesn't just increase cost linearly. It also increases the likelihood that the relevant chunk ends up buried in the middle, hurting quality. You're paying more and getting worse answers.
The Fix Isn't Just Compression
The answer isn't to just minify everything. It's intentionality. Strip the right things. Keep the semantic structure. Preserve tables and lists in a format the model can reason about. Chunk at semantic boundaries, not at arbitrary token counts. Retrieve only what you need, not everything that might be vaguely related.
Cloudflare's Markdown for Agents work points to one path: semantic conversion at the source. If your content pipeline converts HTML to structured Markdown using heuristics that preserve meaning, you start with a cleaner foundation.
But that's not enough alone. You also need to think about caching. Common queries have common answers. If you're running the same retrieval twice, you're doing twice the work. Second, you need to be intentional about chunk size and overlap. Third, you need to monitor what you're retrieving and what the LLM is actually using. Most teams don't. They assume that if they retrieved it, the model used it. Often it didn't.
The Deeper Pattern
Pipeline efficiency and pipeline accuracy aren't separate concerns. They're the same concern seen from different angles. A cleaner input is both cheaper and more likely to produce a correct answer. The web itself is a major source of noisy inputs — most AI crawlers can only see part of what's on a page, which means your retrieved content may already be missing the most important information before chunking even starts.
The teams shipping the best RAG systems understand this. They don't treat the pipeline as a box that ingests documents and outputs results. They treat it as a system where every choice — what to fetch, how to parse it, how to chunk it, how to embed it, how many results to retrieve, how to present it to the model — cascades through to the bottom line and to the quality of the answers the system produces.
That means thinking about token cost doesn't require apologizing for being cost-conscious. It requires being precise about what goes into the system and why.
Built for this problem
Control exactly what AI reads on your site
MachineContext serves clean, structured content to AI bots — JavaScript rendered, properly formatted, always accurate — while keeping your site unchanged for humans.
Get started →