Picking Evaluation Metrics for a RAG Agent — Notes from the Trenches
I recently built a RAG agent on FinanceBench (150 Q&A items over 10-K / 10-Q filings). Picking eval metrics turned out way trickier than I expected. I started thinking eval was just "run an LLM judge at the end." After getting burned a few times I realized: each stage needs its own metrics, and the wrong choice optimizes you in the wrong direction.
Here's how it played out, roughly in the order I made the decisions.
Stage 1: Before writing any code, decide what to measure
Before touching the codebase I split RAG eval into three layers:
- Retrieval layer — Hit Rate, MRR, Precision@K, Recall@K, NDCG. No LLM call, free, runs in seconds. Use this every time you change a chunk size or swap an embedder.
- Context layer — Context Precision, Context Recall. Are the retrieved chunks actually useful? Needs LLM, expensive.
- Generation layer — Faithfulness, Answer Relevancy, Answer Correctness. How good is the answer? Faithfulness matters most — it directly measures hallucination.
The point of the split is cost vs. frequency:
- Daily tuning → free retrieval metrics
- Release gating → RAGAS core
- Final reporting → full pass
Without this split, every small tweak runs the full suite and burns the API budget in days.
Stage 2: Picked 6 of RAGAS's 8 main metrics
Going deep into RAGAS, I picked 6 of its 8 main metrics:
- Faithfulness — every answer claim is supported by context (hallucination detector)
- Answer Correctness — composite of factual decomposition + semantic similarity
- Context Recall — how much of ground truth made it into context (retrieval gaps)
- Context Precision — are relevant docs ranked at the top (signal-to-noise)
- Context Entity Recall — did key entities from the truth show up in context
- Answer Similarity — pure embedding cosine, 0 LLM call
The two I dropped immediately:
- Answer Relevancy — it reverse-generates N hypothetical questions from the answer and cosines them against the original. My questions are very narrow ("FY2018 CapEx for 3M?") — even a wrong answer keeps the relevant keywords, so the regenerated questions stay close to the original. Score sits near 1.0 for every row, no signal.
- Noise Sensitivity — tests whether noisy context confuses the model. But when I lowered the retrieval threshold from default to 0.45 (more chunks = more noise), accuracy went from 50% to 80%. Noise wasn't my bottleneck, so this metric would just burn LLM calls.
I thought these 6 would carry me through. They didn't.
Stage 3: After actually running them — cut 6 down to 2
Each RAGAS metric is 1-5 LLM calls per item, and a full pass on 150 items chews through both time and budget. But the bigger problem: several metrics weren't trustworthy on my data. One by one:
Dropped Answer Similarity. FinanceBench truths are short — e.g. "$1577.00" — but my agent's answers are long markdown with calculation steps. Embedding similarity systematically underscores this "short truth, long answer" setup — the score mostly reflects length difference, not correctness. A misleading metric is worse than no metric.
Dropped Context Entity Recall. Two reasons: (1) NER does poorly on numbers and abbreviations in financial tables, so substring matching produces false negatives on format differences ("$1,577M" vs "1577 million"); (2) it overlaps heavily with Context Recall — dropping it loses little.
Dropped Answer Correctness. RAGAS's default is 0.75 × factual decomposition + 0.25 × embedding similarity — the 25% embedding part has the same problem as Answer Similarity. I tried reweighting to [1.0, 0.0] and added a calibration prompt to stop it from penalizing verbose answers. After all that fiddling it still scored numerical answers chaotically ($302.6M vs $303M = 0.13% off but graded 0). Gave up on this line and wrote my own domain-specific judge (next stage).
Dropped Context Precision. Overlaps with Context Recall in multi-page-gold settings, and the retrieval layer already covers signal-to-noise via Page Hit@K. Computing it again here is redundant.
Only 2 survived:
- Faithfulness — catches hallucination. Nothing else covers this: it specifically watches for claims not in context. Especially important in finance — an agent inventing a number is worse than one admitting it doesn't know.
- Context Recall — catches retrieval gaps. Pairs with retrieval-layer Hit@K: Hit@K tells you "is the page in there"; Context Recall tells you "did the key info from that page actually make it into the chunk" (chunking can split it out).
Combined with the self-built LLM judge + numeric prefilter from the next stage, four things support the whole eval.
Don't run metrics for "completeness." Broken metrics will mislead you; overlapping metrics just cost more. Every metric should answer "what signal do I lose if I cut it?" If you can't answer, cut it.
Stage 4: Upgraded retrieval-layer evaluation
For the first three stages I was working on the RAGAS side. My retrieval-layer metric was embarrassingly toy at first — just "does the answer string appear anywhere in the returned context." Crude substring match.
Upgraded to the IR standard trio:
- Page Hit@K — does top-K contain at least one correct page
- Page Recall@K — what fraction of relevant pages are retrieved
- MRR — mean reciprocal rank of the gold pages
Why all three? Each answers a different question. Hit@K → "did we find it"; Recall@K → "for multi-page gold (23% of my dataset), did we get all of them"; MRR → "once found, what rank."
Concrete example. I iterated the retrieval stack in three rounds, adding one component each time:
| Stage | Hit@5 | MRR |
|---|---|---|
| Baseline (voyage embed + BM25 hybrid) | 84.4% | 0.685 |
| + company filter (detect company in query, add payload filter) | 97.8% | 0.767 |
| + voyage rerank-2 (cross-encoder second-pass) | 100% | 0.863 |
Looking only at MRR, you'd think the filter did nothing (0.685 → 0.767 looks like noise). Looking only at Hit@5, you'd think rerank did nothing (97.8% → 100% looks like noise). Together they show the full picture: filter solved "candidate pool polluted by other companies"; rerank solved "found it but ranked wrong" — two different problems, each surfaced by a different metric.
Final test-105 numbers: Hit@5 = 98.1%, MRR = 0.821. With those, I could confidently blame end-to-end failures on the generation side instead of revisiting retrieval.
Stage 5: Built my own end-to-end judge
After a few end-to-end rounds, RAGAS's Answer Correctness was clearly unreliable on financial numerical questions:
"$302.6M"vs truth"$303M"(0.13% off) → scored 0"Yes, 1.45x"vs truth"Yes, 1.5x"(right direction, 3.3% off) → unstable- Long markdown with calculation steps → occasionally penalized for verbosity
Root cause: a general-purpose judge applies one scoring rule to everything; but in finance, "direct extraction" vs. "derived calculation" need very different tolerances, and text vs. numeric questions need different rules.
So I wrote a single-call GPT-4.1 judge that first classifies the truth into 6 categories, then applies category-specific tolerance:
| Truth category | Tolerance |
|---|---|
| pure_number (direct extraction) | Only format variations (commas, $, unit conversion) — no rounding allowed |
| number_with_context (derived calc) | ±2% → 1.0; ±2-10% same direction → 0.5; > ±10% → 0.0 |
| pure_text / yesno_text | ≥80% coverage → 1.0; 50-80% → 0.5; <50% → 0.0 |
| yesno_* (any yes/no) | HARD RULE: wrong direction → 0.0 |
The key design is classify first, then apply category-specific tolerance — way more accurate than one global threshold. A domain-specific dataset deserves a domain-specific judge.
Final scores: dev 45 correctness = 0.922, test 105 = 0.871. Ship-able.
One bonus trick: numerical questions go through a deterministic prefilter first (±2% → 1.0 directly), only falling back to the LLM judge on a miss. Saves money and improves consistency — same question always gets the same score, no temperature drift between runs.
Takeaways
End-to-end first. Nail "is the final answer correct" before optimizing component-level metrics. Otherwise you'll tune Faithfulness for days and end-to-end won't budge.
Decouple by layer. Different metrics for retrieval vs. generation. When something breaks, you immediately know which layer's at fault.
Few but sharp. RAGAS's 8 main metrics → 2 (Faithfulness + Context Recall). The cuts were either no-signal (Answer Relevancy, Noise Sensitivity), broken in a "short truth, long answer" setup (Answer Similarity, Answer Correctness), or redundant (Context Entity Recall, Context Precision).
Cost-aware. Daily iteration on free metrics (Hit@K / MRR), only burn LLM judges at gating points.
The biggest realization: metrics aren't about coverage, they're about giving you a decision at each stage. Late-stage metrics used early (LLM judge to pick an embedder) → slow and expensive. Early-stage metrics used late (Hit@K to grade an agent) → blind to the real problem.