What is RAG, and when does a business need it.

RAG, retrieval-augmented generation, is the most useful AI architecture for a business that has its own data and wants AI to actually use it. Here is what it is, how it works, and the three questions that decide whether your business needs one.

JFJames FinneganCo-founder · written from the build

RAG, retrieval-augmented generation, is the most useful AI architecture for a business that has its own data and wants AI to actually use it. In one sentence: RAG is a way of giving an AI model access to your specific information at the moment it answers a question, without having to retrain the model.

If your team has documents, records, transcripts, contracts, customer history, internal wikis, regulatory filings, or any other body of organization-specific knowledge that you want AI to draw on, the answer is almost always RAG. This article explains what RAG actually is, how it works, the alternatives, and the three questions that decide whether your business needs one.

What RAG actually is

Think of a large language model (Claude, GPT, Gemini) as a smart generalist. It has read a wide slice of the public internet during training, so it can answer most general questions. But it has never seen your data. It does not know what your company sold last quarter, what your client said in the meeting last Tuesday, or what your compliance policy says about retention.

RAG is the layer that fixes that. It works like this: when someone asks the model a question, the system first retrieves the most relevant pieces of your specific data (three paragraphs from the policy, the line from the meeting transcript, the row from the database). Then it augments the model's prompt with those pieces, attached as context. Then the model generates an answer that is grounded in your actual information instead of its general training.

The acronym, in order: Retrieval (find the right data), Augmented (attach it as context), Generation (the model writes the answer).

A useful analogy: imagine giving a sharp consultant a question and a fat folder of relevant background documents. The consultant is the model. The folder is your data. RAG is the librarian who pulled the right documents and put them in the consultant's hands before the consultant started to think.

How RAG actually works, step by step

Four steps. Walk them in order.

Step one: index your data. Before any question is asked, the system breaks your documents into chunks and stores them in a way that makes them searchable by meaning, not just by keyword. The technical name for this is "embedding." Each chunk gets converted into a vector representation that captures what it is about. (More on embeddings, including the part where this is genuinely interesting, in a future explainer.)

Step two: a user asks a question. The system takes the question, embeds it the same way, and finds the chunks of your data that are most semantically similar. It returns the top three, five, or ten, depending on how the system is tuned.

Step three: those chunks get attached to the prompt. The model never sees your full database. It only sees the question plus the retrieved chunks. This is the security and cost win. You don't have to stuff the entire policy library into every prompt.

Step four: the model answers. Grounded in the retrieved chunks, the model produces a response that reflects your data, not its training. If the retrieved chunks aren't enough to answer the question, a well-built RAG system says so honestly. (When the answer comes back wrong despite the right chunks, that is a hallucination problem. Different fix, different article.)

Why RAG instead of just prompting or fine-tuning

There are three ways to make an AI model behave the way your business needs. RAG is one. The other two are prompting and fine-tuning. They are not interchangeable.

Prompting means writing better instructions in the request. Useful for shaping tone, format, and approach. Useless for giving the model knowledge it does not have. If your question depends on a document the model has never seen, no amount of prompting changes the answer.

Fine-tuning means retraining the model on examples of your specific work. Useful when you want the model to mimic a particular style, classify in your specific way, or behave consistently across thousands of similar tasks. Expensive (typically tens of thousands of dollars to do well), slow (weeks of iteration), and brittle (every time you change the data, you re-tune).

RAG gives the model access to your data at the moment it needs it, without retraining. The data can change daily (new contracts, new policies, new transcripts) and the model picks up the change immediately. No retraining. No long iteration cycle.

For most business workflows, RAG is the right answer. Prompting handles the style; RAG handles the substance. Fine-tuning becomes worth considering only when neither of the first two gets close enough to the behavior you need.

Three questions that decide whether you need RAG

A business does not need RAG every time it touches AI. Three questions filter the cases where RAG is the right move.

1. Does the AI need to draw on data that is specific to your business? If yes, you need RAG (or fine-tuning, but usually RAG). If no, if the work is general-purpose writing, summarizing public content, or pure reasoning over what the user provided in the prompt, RAG adds cost and complexity without buying anything.

2. Does that data change often? If your knowledge base updates weekly, monthly, or in real time, RAG is built for that. New documents go into the index and become available immediately. Fine-tuning struggles here because every update means another training cycle.

3. Is the data too big to fit into a single prompt? Even with modern context windows running into the hundreds of thousands or millions of tokens, most business knowledge bases are larger than a single prompt can hold. RAG retrieves only the relevant pieces, which keeps cost down and accuracy up. If your data is small (say, a single 50-page policy document), you can sometimes skip RAG and stuff the whole thing into the prompt every time.

If your answer is yes to one of these, RAG is the conversation to have. If your answer is yes to two or three, it is the architecture you need.

What most people get wrong about RAG

Three common misconceptions worth naming.

RAG is not a database. It is a layer that rides over one. You still need a place to put your data: the chunks, the embeddings, the metadata. The vector database is the storage. RAG is the retrieval pattern that uses it.

RAG is not magic. The quality of the answer depends almost entirely on the quality of the retrieval. If the system retrieves the wrong three paragraphs, the model writes a confident but wrong answer. The most important engineering in any RAG system is not the model choice; it is the retrieval logic.

A bigger model does not fix a bad RAG. Switching from a mid-tier model to a frontier model improves a RAG system at the margin. Fixing the retrieval (better chunking, better embeddings, better filtering, better re-ranking) improves it by an order of magnitude. The model is the cheap fix. The retrieval layer is where the work lives.

The one thing to remember

If you take one idea from this: RAG is not a switch you turn on. It is a retrieval system you build and tune, and the quality of what it retrieves sets the ceiling on every answer it can produce.

That is the shape of most production AI worth running. The model is rarely the hard part. The system around it (what it can see, what it remembers, what it retrieves at the moment it answers) is the work, and it is the part that decides whether the thing is actually useful.

— James

RAGWorking brainsArchitecture

Straterai Field Notes

Plain-English writing on building AI-native systems — how agents actually work, where they fail, and what we learn shipping them for real companies.

No spam. A couple of emails a month. Unsubscribe anytime.