Member-only story

Making LLMs Much Smarter: Understanding Multi-Turn RAG Systems

Agent Issue
8 min readJan 18, 2025

--

Running a few evaluations on Retrieval-Augmented Generation (RAG) applications reminded me of the early days in deep learning, where subtle changes in data or architecture could have outsized effects on the final performance.

RAG is evolving in a similar way: a single-turn query might be handled beautifully by a large language model, but introduce multiple turns and it suddenly shows you clear gaps in performance.

Why Multi-Turn Conversations Are So Challenging

In many existing QA and IR benchmarks, the scenario is straightforward: we have a single question and a static set of documents.

The real world, though, is anything but static.

Human conversation winds and twists, with people following tangents or referencing past remarks.

Standard benchmarks often miss these complexities — especially the dynamic context updates that real users expect.

Some multi-turn benchmarks do exist, but many ignore the retrieval component.

From a software engineering perspective, that’s like testing only the “dialogue orchestration” submodule and leaving out the entire “retrieval microservice” that supplies the contextual knowledge.

That’s where benchmarks such as MTRAG come in handy.

MTRAG from IBM Research is the first end-to-end human generated multi-turn RAG

--

--

Agent Issue
Agent Issue

Written by Agent Issue

Your front-row seat to the future of Agents.

Responses (1)