Member-only story
Making LLMs Much Smarter: Understanding Multi-Turn RAG Systems
Running a few evaluations on Retrieval-Augmented Generation (RAG) applications reminded me of the early days in deep learning, where subtle changes in data or architecture could have outsized effects on the final performance.
RAG is evolving in a similar way: a single-turn query might be handled beautifully by a large language model, but introduce multiple turns and it suddenly shows you clear gaps in performance.
Why Multi-Turn Conversations Are So Challenging
In many existing QA and IR benchmarks, the scenario is straightforward: we have a single question and a static set of documents.
The real world, though, is anything but static.
Human conversation winds and twists, with people following tangents or referencing past remarks.
Standard benchmarks often miss these complexities — especially the dynamic context updates that real users expect.
Some multi-turn benchmarks do exist, but many ignore the retrieval component.
From a software engineering perspective, that’s like testing only the “dialogue orchestration” submodule and leaving out the entire “retrieval microservice” that supplies the contextual knowledge.
That’s where benchmarks such as MTRAG come in handy.
MTRAG from IBM Research is the first end-to-end human generated multi-turn RAG…