Yuan's Blog
EN

Picking Evaluation Metrics for a RAG Agent — Notes from the Trenches

I recently built a RAG agent on FinanceBench (150 Q&A items over 10-K / 10-Q filings). Picking eval metrics turned out way trickier than I expected. I started thinking eval was just "run an LLM judge at the end." After getting burned a few times I realized: each stage needs its own metrics, and the wrong choice optimizes you in the wrong direction.

Here's how it played out, roughly in the order I made the decisions.


Stage 1: Before writing any code, decide what to measure

Before touching the codebase I split RAG eval into three layers:

  • Retrieval layer — Hit Rate, MRR, Precision@K, Recall@K, NDCG. No LLM call, free, runs in seconds. Use this every time you change a chunk size or swap an embedder.
  • Context layer — Context Precision, Context Recall. Are the retrieved chunks actually useful? Needs LLM, expensive.
  • Generation layer — Faithfulness, Answer Relevancy, Answer Correctness. How good is the answer? Faithfulness matters most — it directly measures hallucination.

The point of the split is cost vs. frequency:

  • Daily tuning → free retrieval metrics
  • Release gating → RAGAS core
  • Final reporting → full pass

Without this split, every small tweak runs the full suite and burns the API budget in days.


Stage 2: Picked 6 of RAGAS's 8 main metrics

Going deep into RAGAS, I picked 6 of its 8 main metrics:

  1. Faithfulness — every answer claim is supported by context (hallucination detector)
  2. Answer Correctness — composite of factual decomposition + semantic similarity
  3. Context Recall — how much of ground truth made it into context (retrieval gaps)
  4. Context Precision — are relevant docs ranked at the top (signal-to-noise)
  5. Context Entity Recall — did key entities from the truth show up in context
  6. Answer Similarity — pure embedding cosine, 0 LLM call

The two I dropped immediately:

  • Answer Relevancy — it reverse-generates N hypothetical questions from the answer and cosines them against the original. My questions are very narrow ("FY2018 CapEx for 3M?") — even a wrong answer keeps the relevant keywords, so the regenerated questions stay close to the original. Score sits near 1.0 for every row, no signal.
  • Noise Sensitivity — tests whether noisy context confuses the model. But when I lowered the retrieval threshold from default to 0.45 (more chunks = more noise), accuracy went from 50% to 80%. Noise wasn't my bottleneck, so this metric would just burn LLM calls.

I thought these 6 would carry me through. They didn't.


Stage 3: After actually running them — cut 6 down to 2

Each RAGAS metric is 1-5 LLM calls per item, and a full pass on 150 items chews through both time and budget. But the bigger problem: several metrics weren't trustworthy on my data. One by one:

Dropped Answer Similarity. FinanceBench truths are short — e.g. "$1577.00" — but my agent's answers are long markdown with calculation steps. Embedding similarity systematically underscores this "short truth, long answer" setup — the score mostly reflects length difference, not correctness. A misleading metric is worse than no metric.

Dropped Context Entity Recall. Two reasons: (1) NER does poorly on numbers and abbreviations in financial tables, so substring matching produces false negatives on format differences ("$1,577M" vs "1577 million"); (2) it overlaps heavily with Context Recall — dropping it loses little.

Dropped Answer Correctness. RAGAS's default is 0.75 × factual decomposition + 0.25 × embedding similarity — the 25% embedding part has the same problem as Answer Similarity. I tried reweighting to [1.0, 0.0] and added a calibration prompt to stop it from penalizing verbose answers. After all that fiddling it still scored numerical answers chaotically ($302.6M vs $303M = 0.13% off but graded 0). Gave up on this line and wrote my own domain-specific judge (next stage).

Dropped Context Precision. Overlaps with Context Recall in multi-page-gold settings, and the retrieval layer already covers signal-to-noise via Page Hit@K. Computing it again here is redundant.

Only 2 survived:

  • Faithfulness — catches hallucination. Nothing else covers this: it specifically watches for claims not in context. Especially important in finance — an agent inventing a number is worse than one admitting it doesn't know.
  • Context Recall — catches retrieval gaps. Pairs with retrieval-layer Hit@K: Hit@K tells you "is the page in there"; Context Recall tells you "did the key info from that page actually make it into the chunk" (chunking can split it out).

Combined with the self-built LLM judge + numeric prefilter from the next stage, four things support the whole eval.

Don't run metrics for "completeness." Broken metrics will mislead you; overlapping metrics just cost more. Every metric should answer "what signal do I lose if I cut it?" If you can't answer, cut it.


Stage 4: Upgraded retrieval-layer evaluation

For the first three stages I was working on the RAGAS side. My retrieval-layer metric was embarrassingly toy at first — just "does the answer string appear anywhere in the returned context." Crude substring match.

Upgraded to the IR standard trio:

  • Page Hit@K — does top-K contain at least one correct page
  • Page Recall@K — what fraction of relevant pages are retrieved
  • MRR — mean reciprocal rank of the gold pages

Why all three? Each answers a different question. Hit@K → "did we find it"; Recall@K → "for multi-page gold (23% of my dataset), did we get all of them"; MRR → "once found, what rank."

Concrete example. I iterated the retrieval stack in three rounds, adding one component each time:

StageHit@5MRR
Baseline (voyage embed + BM25 hybrid)84.4%0.685
+ company filter (detect company in query, add payload filter)97.8%0.767
+ voyage rerank-2 (cross-encoder second-pass)100%0.863

Looking only at MRR, you'd think the filter did nothing (0.685 → 0.767 looks like noise). Looking only at Hit@5, you'd think rerank did nothing (97.8% → 100% looks like noise). Together they show the full picture: filter solved "candidate pool polluted by other companies"; rerank solved "found it but ranked wrong" — two different problems, each surfaced by a different metric.

Final test-105 numbers: Hit@5 = 98.1%, MRR = 0.821. With those, I could confidently blame end-to-end failures on the generation side instead of revisiting retrieval.


Stage 5: Built my own end-to-end judge

After a few end-to-end rounds, RAGAS's Answer Correctness was clearly unreliable on financial numerical questions:

  • "$302.6M" vs truth "$303M" (0.13% off) → scored 0
  • "Yes, 1.45x" vs truth "Yes, 1.5x" (right direction, 3.3% off) → unstable
  • Long markdown with calculation steps → occasionally penalized for verbosity

Root cause: a general-purpose judge applies one scoring rule to everything; but in finance, "direct extraction" vs. "derived calculation" need very different tolerances, and text vs. numeric questions need different rules.

So I wrote a single-call GPT-4.1 judge that first classifies the truth into 6 categories, then applies category-specific tolerance:

Truth categoryTolerance
pure_number (direct extraction)Only format variations (commas, $, unit conversion) — no rounding allowed
number_with_context (derived calc)±2% → 1.0; ±2-10% same direction → 0.5; > ±10% → 0.0
pure_text / yesno_text≥80% coverage → 1.0; 50-80% → 0.5; <50% → 0.0
yesno_* (any yes/no)HARD RULE: wrong direction → 0.0

The key design is classify first, then apply category-specific tolerance — way more accurate than one global threshold. A domain-specific dataset deserves a domain-specific judge.

Final scores: dev 45 correctness = 0.922, test 105 = 0.871. Ship-able.

One bonus trick: numerical questions go through a deterministic prefilter first (±2% → 1.0 directly), only falling back to the LLM judge on a miss. Saves money and improves consistency — same question always gets the same score, no temperature drift between runs.


Takeaways

  1. End-to-end first. Nail "is the final answer correct" before optimizing component-level metrics. Otherwise you'll tune Faithfulness for days and end-to-end won't budge.

  2. Decouple by layer. Different metrics for retrieval vs. generation. When something breaks, you immediately know which layer's at fault.

  3. Few but sharp. RAGAS's 8 main metrics → 2 (Faithfulness + Context Recall). The cuts were either no-signal (Answer Relevancy, Noise Sensitivity), broken in a "short truth, long answer" setup (Answer Similarity, Answer Correctness), or redundant (Context Entity Recall, Context Precision).

  4. Cost-aware. Daily iteration on free metrics (Hit@K / MRR), only burn LLM judges at gating points.

The biggest realization: metrics aren't about coverage, they're about giving you a decision at each stage. Late-stage metrics used early (LLM judge to pick an embedder) → slow and expensive. Early-stage metrics used late (Hit@K to grade an agent) → blind to the real problem.