Why We Abandoned RAG: Six Fundamental Problems

RAG isn’t a bad idea. We took it seriously, built it out, and it did work in certain scenarios.

Last year, we spent months building a complete RAG pipeline: a three-stage processing flow (Extract, Chunk, Embed) and three retrieval strategies (Vector, BM25, Hybrid + Reranking). We worked through every component carefully, from text extraction to reranking models. It was technically elegant.

But we eventually had to face an honest truth: it wasn’t good enough.

This post isn’t a critique of RAG. It’s an honest account of the specific problems we ran into, and how our thinking evolved.

Problem 1: The Embedding Model Dilemma

For a local desktop application, choosing an embedding model is a problem with no good answer.

Small models (< 500M parameters) can run on-device, but their semantic understanding is inconsistent — recall drops noticeably when handling specialized documents, cross-language queries, or long texts. Large models (1B+) produce better quality, but their memory and compute demands are too heavy for an ordinary laptop. Running them in the background puts unacceptable pressure on system resources.

A desktop app has no server to fall back on. You’re forced to compromise between “runnable” and “capable.” Choose one, and the other suffers. This dilemma doesn’t exist for server-side applications — but for local-first apps, there’s no clean way out.

Problem 2: Insensitivity to Domain Vocabulary

Semantic vector search has a fundamental blind spot: it handles specialized terminology poorly.

The reason isn’t complicated. Embedding models are trained on general corpora, where code function names, medical abbreviations, legal clauses, and product model numbers appear infrequently. These terms end up in obscure, unstable positions in the embedding space.

What does this look like in practice? A user searches “RLHF” but doesn’t necessarily surface documents containing “Reinforcement Learning from Human Feedback.” Search “LTV” and you might miss a report discussing “customer lifetime value.” Search a specific product model number and the vector search simply can’t latch onto its precise meaning.

This isn’t a configuration problem that parameter tuning can fix. The standard industry workaround is domain-specific embedding fine-tuning — but that only works for narrow B2B verticals.

Embedding excels at fuzzy semantic matching. Its weakness is exact vocabulary matching. Real users need both.

Problem 3: The Cost of Reranking

Low recall and poor precision are two classic problems in RAG pipelines. The industry’s standard fix for precision is to add a reranking model as a final step.

We implemented this too — and found that the problem wasn’t solved, just relocated.

Reranking models are heavier and slower than embedding models. Adding one significantly increased the latency of the entire retrieval chain, which is especially painful in a local application. More critically, rerankers are also trained on general corpora and share the same domain vocabulary blind spot. They can only reorder the candidates you’ve already retrieved — they can’t recover documents that were never recalled in the first place.

The result: a slower pipeline, a more complex architecture, and the root problem still unresolved. After adding the reranker, ranking quality improved only marginally, while BM25’s contribution was nearly buried under the added complexity.

Problem 4: Fragmented Context

Chunking is the inescapable problem at the heart of RAG.

Once a document is cut into fixed-size segments, each segment loses connection with what came before and after. The AI receives a passage extracted from the middle of a report, with no knowledge of which section it belongs to, what the preceding paragraph discussed, or whether a conclusion follows.

The worst case: a critical paragraph straddles the boundary of two chunks. Both chunks partially match, but neither is complete. The AI receives two fragments that each touch the answer but neither contains the full picture — and produces a response that sounds plausible but isn’t quite right.

There are many patching strategies for this: larger chunk overlaps, parent chunk retrieval, Small-to-Big indexing… Each patch improves one dimension while introducing new costs — more tokens, more pipeline complexity, harder-to-debug behavior, and less generalizability.

We stacked these patches together and ended up with a system that was complicated, fragile, and still not good enough.

Problem 5: Every Document Type Needs Special Treatment

Generic chunking strategies perform wildly differently across document types — something we didn’t fully anticipate.

Research papers have Abstract + Body + References structure. Books have chapter hierarchies and running headers. Contracts have numbered clauses and cross-references. Code documentation has API listings and example code. In a spreadsheet, the meaningful “content” is the column names and data types, not the cell values.

Fixed-window chunking doesn’t understand any of this structure. Split points often land in the middle of a semantic unit — separating a heading from its body, cutting a clause number from its clause text, splitting a table header from its data rows.

Each document type really demands its own parsing and chunking logic. But writing specialized parsers and chunking strategies for every type is enormously labor-intensive to build and to maintain — and even then, the result is only “somewhat better than the generic approach,” still fragmented.

Problem 6: A Poor Experience for AI Agents

Taken individually, each of the above problems is tolerable. But when RAG is actually connected to an AI Agent, all five problems compound, and the result is genuinely bad.

A real scenario: the AI is helping a user analyze a contract. It calls search() to retrieve relevant clauses and gets back 10 chunks. A few chunks are partially relevant, but the information is incomplete. The AI can’t determine how to proceed, so it adjusts the search query and tries again. Another 10 chunks — still not enough. Another query, another search.

Each search is a black box: the AI doesn’t know which keywords will surface what it needs, doesn’t know whether the document even contains the information, and has no sense of how close it is to an answer. This inefficiency isn’t a failure of the Agent’s capability — it’s that the tool’s design doesn’t support rational decision-making.

RAG is optimized for “user asks a question” scenarios. It was never designed for “Agent autonomously explores” scenarios.

The Industry Is Shifting

These problems aren’t unique to us. There are clear trends emerging in how the industry is responding.

Microsoft’s GraphRAG introduces knowledge graphs to address context fragmentation — storing related entities and relationships explicitly rather than relying on fragment reassembly.

PageIndex abandons fixed-size chunking in favor of page-level indexing, preserving the document’s natural boundaries.

Agentic RAG attempts to let the AI self-direct its retrieval strategy rather than following a fixed pipeline. The direction is right, but layering Agent logic on top of a RAG architecture doubles the complexity.

The most radical departure comes from Claude Code and Manus. They abandoned RAG entirely and returned to the most primitive approach: Glob + Grep + Read. Find files, search for keywords, read content. No vector database, no embedding model, no chunking pipeline. The results are actually better.

This clarified something for us: RAG’s design assumption is that “LLMs aren’t smart enough, so we need to pre-process information for them.” That was reasonable in the GPT-3.5 era. But today’s LLMs are capable of autonomously using tools to complete multi-step retrieval tasks. They don’t need pre-fragmented content — they need clues: where are the files, what’s the structure, and then they can decide what to read and how much.

Our Solution: Outline Index

Glob + Grep + Read works beautifully for codebases, but not for user documents. In a codebase, the path src/services/auth.ts already tells you this is the authentication service. But Q4-2024-Summary(revised)(FINAL).docx — the path tells you almost nothing. And PDFs and Word documents are binary formats; grep can’t read them at all.

So our question became: can we build an equivalent “table of contents index” for documents, letting the AI navigate files progressively using a search → outline → read pattern?

We call this approach Outline Index.

The core idea in one sentence: instead of pre-fragmenting information for the AI, give it a map.

For each document, we build a structured “profile” containing the document’s metadata (title, author, keywords, summary) and structural outline (section headings, hierarchy, line number ranges). The AI accesses documents through three layers:

search: find relevant documents, returning a file list with metadata — approximately 50 tokens per file
outline: view the structural map of a document — approximately 200–500 tokens per file
read: precisely retrieve the full text of a specified section — loaded on demand

This mirrors exactly how humans read: find the book, check the table of contents, flip to the relevant chapter for a close read. Throughout this process, the AI has full context — it knows where it is in the document, can decide to “read a bit more,” and can compare across multiple documents.

Compared to traditional RAG: in the same scenario, Outline Index consumes approximately 800–3,400 tokens while giving the AI precise information with complete context. Traditional RAG returns 10 pre-cut fragments consuming 4,000–6,000 tokens, while the AI knows nothing about the document’s structure.

A useful side effect: the embedding target shifts from raw text chunks to the Outline Index itself. Each document requires only one vector. 10,000 documents ≈ 10,000 vectors ≈ 30 MB of storage, with much faster retrieval.

For the domain vocabulary problem, BM25 full-text search fills the gap. Dual-path retrieval (BM25 for exact matching + vector search for semantic understanding) fused via RRF eliminates the need for a reranking model entirely.

Outline Index is the core technology behind Linkly AI. If you’re interested in the implementation details, you can read the technical article here: Outlines Index: A Progressive Disclosure Method for Exposing Large Document Collections to AI Agents.

If you want to experience it in action, download Linkly AI and linkly-ai-cli, then connect it to an AI client. The real-world results are far better than RAG.

From the Linkly AI team.