May 25, 2026·11 min read·BuzzSoftware Team

Multi-Hop RAG at Scale: Inside Teilor's Opal Knowledge Base

RAGLLMCase StudyRetrievalAI Architecture

Most production RAG systems break the moment a user asks a real question. Demos use queries like "What is our refund policy?" — a single, atomic lookup. Real users ask "If a customer bought a 1.2 carat solitaire last March on the installment plan, what is their remaining balance, and what is our exchange policy if they want to upsize?"

That question is not one retrieval. It is a chain: identify the customer, find the order, look up payment history, look up the exchange policy, synthesize an answer with citations. Single-shot RAG cannot do it. Multi-hop RAG can, and getting it right in production took us six months of iteration on Teilor's Opal Knowledge Base — the AI assistant we built for Romania's premier luxury jewelry brand.

This article is the architecture, the trade-offs, and the things that broke.

The setup

Teilor's employees needed a single AI assistant that could:

Answer policy questions from hundreds of internal documents (PDFs, Word files, presentations, spreadsheets)
Pull live data from their Opal ERP — products, customers, orders, invoices, sales opportunities
Hold conversational context across follow-up questions
Cite every claim with a source
Work in Romanian and English, often within the same conversation

Three retrievals from three different systems, blended into one streamed answer. With citations. With per-user authentication so each employee sees only what they are allowed to see. In production. With acceptable latency.

The architecture, in one breath

The system is three services in three languages, each chosen for its strengths:

Go backend — orchestration, RAG pipeline, query classification, streaming, persistence, background jobs
Python FastAPI microservice — cross-encoder reranking with BAAI/bge-reranker-v2-m3
React 19 frontend — chat UI, real-time messaging, PWA shell

The data plane:

Qdrant for vector search over chunked document embeddings (bge-m3, multilingual)
PostgreSQL for conversation history, user data, job queue (via River)
Opal ERP REST API for live CRM data, accessed with per-user bearer credentials
Anthropic Claude for response generation, streamed via Server-Sent Events

The reason this stack works: each system answers a different kind of question, and the orchestrator decides which to invoke.

Query classification — the unsung hero

Single-shot RAG sends every question through the same pipeline. Multi-hop RAG starts by deciding what kind of question this is. Teilor's classifier sorts incoming queries into four buckets:

Document lookup. "What is our return policy for engagement rings?" → vector search over documents, no CRM call.
CRM lookup. "Show me orders from customer 4821." → CRM plugin call, no document search.
Hybrid. "What is the exchange policy for the order I'm looking at?" → both, blended.
Conversational. "Thanks, that's helpful." → no retrieval, direct LLM reply.

The classifier is itself a small LLM call, with a prompt that includes recent conversation context. It is fast (under 400ms), and it determines the entire downstream pipeline. Get it wrong and the user gets either irrelevant document chunks or a hallucinated CRM lookup. Get it right and the rest of the system has clean inputs.

We tried hand-rolled heuristics first (keyword matching, regex). They failed badly on multilingual queries. The LLM classifier won outright and we never looked back.

Document retrieval — the parts that matter

The document side is the most "textbook RAG" part of the stack, but several details were not obvious in advance.

Docling for parsing, not pdf-extract. We started with naive PDF text extraction. Tables shredded, footnotes inlined as if they were body text, slide decks rendered as gibberish. Docling — IBM's open-source document understanding library — extracts structure: tables stay as tables, slide decks parse per slide, headings become headings. The retrieval quality improvement was not subtle.

bge-m3 for embeddings, run via Ollama. Multilingual was non-negotiable; bge-m3 handles 100+ languages well and is small enough to run on commodity GPUs. Self-hosted via Ollama keeps document content on Teilor's infrastructure.

Chunking by structure, not by token count. The default "1,000 tokens with 200-token overlap" produced chunks that crossed section boundaries and confused retrieval. Chunking by document structure (section → subsection → paragraph) made retrieval substantially more precise.

Reranking is not optional. Top-10 vector search results are noisy. We retrieve top-50, then rerank with the bge-reranker-v2-m3 cross-encoder (running in the Python microservice), keeping the top 5-8 for the LLM. Latency cost: ~300ms. Quality gain: the difference between "useful answer" and "embarrassing wrong answer."

CRM retrieval — the harder problem

CRM access turned out to be where most of the engineering went. Three reasons:

Authorization is per-user. Each employee's view of the CRM is different. The AI assistant cannot run with a service account — it must forward the user's own credentials to the Opal ERP, so what the user can see in the assistant exactly matches what they can see in the ERP. We built a per-request bearer token forwarding layer that propagates the user's session token through the LLM tool call into the upstream REST request.

Tool design matters more than model intelligence. We exposed five Opal endpoints to Claude as tools — findProducts, findCustomers, findOrders, findInvoices, findOpportunities. Each tool has a tight schema (specific filter parameters, paginated responses, max result limits). The first version had generic queryOpal(endpoint, params) and the model immediately started constructing wrong endpoint URLs. Tightening the tool surface to aggregate-level operations fixed it.

Result formatting feeds back into hallucination risk. When a CRM tool returns 50 fields per result, the model gets distracted and hallucinates fields. We learned to project results down to the 5-10 fields the user actually asked about, before they ever reach the LLM. Less context, more accuracy.

Multi-hop orchestration — where it gets interesting

For hybrid queries ("policy + live data"), the orchestration runs:

Classifier decides "hybrid."
Query is decomposed into a document part and a CRM part by a planning LLM call.
Document retrieval and CRM retrieval run in parallel.
Both result sets are passed to Claude with a synthesis prompt: "Answer the user's question using these documents and this CRM data. Cite each claim with the source. Do not invent fields not present in the data."
The answer streams back to the user with citation markers inline.

Two patterns we tried and abandoned:

Sequential planning ("first retrieve docs, then decide if CRM is needed"). Too slow; doubles the latency for hybrid queries.
Self-asking ("model generates follow-up retrievals as needed"). Hallucinated retrievals frequently. The model would invent a findReturnPolicy tool that did not exist.

Parallel retrieval with up-front decomposition was the right trade-off for our use case.

Citations — the trust layer

Every claim in every answer is grounded in either a document chunk or a CRM record. The frontend renders inline citation markers ([1], [2]) that expand into the actual source. Employees can verify the answer in two clicks, and they do — internal trust in the assistant came from the citations as much as from the answer quality.

Implementation: the LLM is prompted to emit citations in a specific format. The backend parses citations out of the streamed response and attaches them to the user-facing message. Sources that the model claims to cite but cannot be linked back to retrieved chunks are flagged in logs (hallucinated citations happen; you need to know).

What we measure

The metrics that matter in production are not the academic RAG metrics:

Citation accuracy — fraction of claims that match the cited source. Manual sample, weekly. Target: >95%.
Time to first token — under 1.5s for single-hop, under 3s for hybrid.
Tool call success rate — fraction of CRM tool calls that return a non-empty result. Low rate means classifier or query construction is broken.
Hallucinated citation rate — citations that point to retrieved chunks but make claims not supported by them. Tracked, alerted on.
User-flagged "wrong answer" rate — the only metric that ultimately matters.

We do not chase a single benchmark number. We track all five and act when any of them degrades.

What broke (and how we fixed it)

A few war stories:

The Romanian tokenizer issue. Early versions chunked Romanian documents using a tokenizer that miscounted diacritics, producing chunks of wildly varying length. Retrieval quality cratered. Fixed by switching to a tokenizer with proper Unicode handling.
The CRM credential expiry cascade. When a user's session token expired mid-conversation, every CRM tool call started failing silently. The model would helpfully "infer" answers. Fixed with proactive token refresh + explicit tool-failure surfacing in the prompt.
The reranker timeout under load. The Python reranker microservice was single-threaded by default and started timing out when 10+ users queried at once. Fixed by adding a request queue with proper concurrency limits.

None of these were predicted in the design phase. All of them showed up within the first two weeks of real-user traffic.

What we would do differently

If we started this build today (May 2026):

Skip the homemade query planner. We would use a framework like LangGraph or DSPy for the orchestration layer rather than rolling it. The hand-rolled version works but the maintenance cost is real.
Invest earlier in evaluation infrastructure. We measured by hand for too long. A small homegrown eval harness with 100 reference queries and graded answers would have paid for itself by month two.
Plan for tool authorization from day one. Per-user credential forwarding was retrofitted. It should have been a load-bearing design constraint from the first sprint.

The bigger picture

Multi-hop RAG is not a feature. It is the difference between "demo-grade AI assistant" and "production AI assistant for a business." The added complexity — query classification, parallel retrieval, citation grounding, tool authorization, reranking — is what turns "neat" into "trustworthy."

The Teilor team uses Opal KB every day. The most common feedback is not "this is impressive AI" — it is "I found the answer in 10 seconds instead of 20 minutes." That is what production RAG looks like when it works. Boring, fast, cited, and quietly indispensable.

If you are building a RAG system in 2026 and seeing impressive demos that fall apart on real questions, the gap is almost certainly multi-hop orchestration. The patterns above are not novel; they are just what it takes to get there.

All posts

May 25, 2026·11 min read·BuzzSoftware Team

Multi-Hop RAG at Scale: Inside Teilor's Opal Knowledge Base

RAGLLMCase StudyRetrievalAI Architecture

This article is the architecture, the trade-offs, and the things that broke.

The setup

Teilor's employees needed a single AI assistant that could:

Answer policy questions from hundreds of internal documents (PDFs, Word files, presentations, spreadsheets)
Pull live data from their Opal ERP — products, customers, orders, invoices, sales opportunities
Hold conversational context across follow-up questions
Cite every claim with a source
Work in Romanian and English, often within the same conversation

The architecture, in one breath

The system is three services in three languages, each chosen for its strengths:

Go backend — orchestration, RAG pipeline, query classification, streaming, persistence, background jobs
Python FastAPI microservice — cross-encoder reranking with BAAI/bge-reranker-v2-m3
React 19 frontend — chat UI, real-time messaging, PWA shell

The data plane:

Qdrant for vector search over chunked document embeddings (bge-m3, multilingual)
PostgreSQL for conversation history, user data, job queue (via River)
Opal ERP REST API for live CRM data, accessed with per-user bearer credentials
Anthropic Claude for response generation, streamed via Server-Sent Events

The reason this stack works: each system answers a different kind of question, and the orchestrator decides which to invoke.

Query classification — the unsung hero

Single-shot RAG sends every question through the same pipeline. Multi-hop RAG starts by deciding what kind of question this is. Teilor's classifier sorts incoming queries into four buckets:

Document lookup. "What is our return policy for engagement rings?" → vector search over documents, no CRM call.
CRM lookup. "Show me orders from customer 4821." → CRM plugin call, no document search.
Hybrid. "What is the exchange policy for the order I'm looking at?" → both, blended.
Conversational. "Thanks, that's helpful." → no retrieval, direct LLM reply.

We tried hand-rolled heuristics first (keyword matching, regex). They failed badly on multilingual queries. The LLM classifier won outright and we never looked back.

Document retrieval — the parts that matter

The document side is the most "textbook RAG" part of the stack, but several details were not obvious in advance.

CRM retrieval — the harder problem

CRM access turned out to be where most of the engineering went. Three reasons:

Multi-hop orchestration — where it gets interesting

For hybrid queries ("policy + live data"), the orchestration runs:

Classifier decides "hybrid."
Query is decomposed into a document part and a CRM part by a planning LLM call.
Document retrieval and CRM retrieval run in parallel.
Both result sets are passed to Claude with a synthesis prompt: "Answer the user's question using these documents and this CRM data. Cite each claim with the source. Do not invent fields not present in the data."
The answer streams back to the user with citation markers inline.

Two patterns we tried and abandoned:

Sequential planning ("first retrieve docs, then decide if CRM is needed"). Too slow; doubles the latency for hybrid queries.
Self-asking ("model generates follow-up retrievals as needed"). Hallucinated retrievals frequently. The model would invent a findReturnPolicy tool that did not exist.

Parallel retrieval with up-front decomposition was the right trade-off for our use case.

Citations — the trust layer

What we measure

The metrics that matter in production are not the academic RAG metrics:

Citation accuracy — fraction of claims that match the cited source. Manual sample, weekly. Target: >95%.
Time to first token — under 1.5s for single-hop, under 3s for hybrid.
Tool call success rate — fraction of CRM tool calls that return a non-empty result. Low rate means classifier or query construction is broken.
Hallucinated citation rate — citations that point to retrieved chunks but make claims not supported by them. Tracked, alerted on.
User-flagged "wrong answer" rate — the only metric that ultimately matters.

We do not chase a single benchmark number. We track all five and act when any of them degrades.

What broke (and how we fixed it)

A few war stories:

The Romanian tokenizer issue. Early versions chunked Romanian documents using a tokenizer that miscounted diacritics, producing chunks of wildly varying length. Retrieval quality cratered. Fixed by switching to a tokenizer with proper Unicode handling.
The CRM credential expiry cascade. When a user's session token expired mid-conversation, every CRM tool call started failing silently. The model would helpfully "infer" answers. Fixed with proactive token refresh + explicit tool-failure surfacing in the prompt.
The reranker timeout under load. The Python reranker microservice was single-threaded by default and started timing out when 10+ users queried at once. Fixed by adding a request queue with proper concurrency limits.

None of these were predicted in the design phase. All of them showed up within the first two weeks of real-user traffic.

What we would do differently

If we started this build today (May 2026):

Skip the homemade query planner. We would use a framework like LangGraph or DSPy for the orchestration layer rather than rolling it. The hand-rolled version works but the maintenance cost is real.
Invest earlier in evaluation infrastructure. We measured by hand for too long. A small homegrown eval harness with 100 reference queries and graded answers would have paid for itself by month two.
Plan for tool authorization from day one. Per-user credential forwarding was retrofitted. It should have been a load-bearing design constraint from the first sprint.

Multi-Hop RAG at Scale: Inside Teilor's Opal Knowledge Base

The setup

The architecture, in one breath

Query classification — the unsung hero

Document retrieval — the parts that matter

CRM retrieval — the harder problem

Multi-hop orchestration — where it gets interesting

Citations — the trust layer

What we measure

What broke (and how we fixed it)

What we would do differently

The bigger picture

Bring us the messy part.

All posts

Multi-Hop RAG at Scale: Inside Teilor's Opal Knowledge Base

The setup

The architecture, in one breath

Query classification — the unsung hero

Document retrieval — the parts that matter

CRM retrieval — the harder problem

Multi-hop orchestration — where it gets interesting

Citations — the trust layer

What we measure

What broke (and how we fixed it)

What we would do differently

The bigger picture

Bring us the messy part.

All posts