RAG Evaluation Harness

The problem

Most RAG systems get evaluated manually, inconsistently, or not at all before they go into production. Engineers run a few test questions, the answers look reasonable, and the system ships. The first time a stakeholder gets a confidently wrong answer sourced from outdated documentation, the trust in the entire system collapses.

The real problem is not the retrieval or the generation — it is the absence of a repeatable, automated evaluation loop that can gate deployments the same way unit tests gate code.

What this lab built

A self-contained evaluation harness that runs against any RAG pipeline output and produces a structured scorecard across three dimensions:

Faithfulness — does the generated answer make claims that are actually supported by the retrieved context? Measured by decomposing the answer into atomic claims and verifying each one against the source chunks.

Answer relevance — does the response actually address the question that was asked, or does it answer an adjacent question that the retrieved context happened to support?

Context groundedness — are the retrieved chunks relevant to the question, or is the retrieval stage surfacing noise that forces the LLM to work around the context?

Each dimension produces a score from 0–1 with a trace of which claims passed or failed, making it actionable rather than just a number.

Architecture decisions

The harness is built as a standalone evaluation service, not embedded in the RAG pipeline itself. This was a deliberate choice — evaluation logic that lives inside the pipeline gets compromised by the same failure modes it is trying to detect.

The golden dataset is version-controlled alongside the source documents it was derived from. When source documents change, the affected golden question–answer pairs are flagged for review. This keeps the evaluation set from drifting silently.

An LLM-as-judge pattern is used for faithfulness scoring, with a smaller, cheaper model doing the decomposition and a larger model handling edge-case adjudication. The cost difference justifies the two-stage approach at production evaluation volume.

What I learned

Evaluation is a product problem, not just a technical one. The hardest part was not the scoring logic — it was defining what "correct" means for questions where multiple valid answers exist. Getting domain stakeholders to validate the golden dataset took longer than building the harness.

Context groundedness is the canary. In every pipeline I evaluated, context groundedness degraded first when the underlying documents changed. It is the fastest signal that something upstream is wrong.

Don't conflate evaluation metrics with business outcomes. A high faithfulness score means the model is not hallucinating beyond what the context contains — but the context itself might be wrong. Evaluation should be layered with source quality checks.

Status

Active. The harness is deployed as an internal evaluation service running on a nightly schedule against a staging RAG deployment, with results logged to a dashboard. The next iteration will add chunk-level attribution so failures can be traced back to specific document sections.

The problem

What this lab built

Architecture decisions

What I learned

Status

Want this built for your context?