Yuan's Blog
EN

Financial RAG Agent Optimization:Methods, Cases, and Data

Built an agentic RAG system for financial Q&A on FinanceBench. Once the baseline hit 0.87, the interesting work began: pushing it higher.

Project Setup

Dataset: FinanceBench — 150 questions across 168 page-level 10-K / 10-Q / 8-K / earnings release PDFs. Question types range from fact lookup ("3M's FY2018 capex") to interpretation ("Is Pfizer spinning off Upjohn?").

Baseline pipeline:

  • Retrieval: voyage-finance-2 dense (1024d) + Qdrant BM25 sparse + RRF fusion + company-aware payload filter + voyage rerank-2
  • Agent: LangGraph multi-step, Claude sonnet-4-6
  • Eval: custom GPT-4.1 strict judge (0 / 0.5 / 1.0, category-calibrated for 6 FinanceBench truth types). Not RAGAS auto-correctness — RAGAS is too lenient on numeric precision (agent returns PDF-raw $302.6M vs truth $303M; RAGAS penalizes, but the agent is correct).

Results:

  • dev (45): 0.922 (41.5/45)
  • test (105): 0.871 (91.5/105)
  • 18 failures total (16 at 0.0, 2 at 0.5)

A single score hides the shape of failure. I split each of the 18 and applied 4 independent fixes (P1 / P0 v3 / P3 / P4), recovering 10 and pushing test to ~0.919. The remaining 8 are unresolved — each with a documented reason.


18 Failures in 6 Categories

ClassMeaningCountRepresentative
A Output disciplineAgent behavior: refusal / over-reasoning / missed qualifier / missed alt-definition1100540 AES refuses to compute inventory turnover
B Judge over-strictAgent gives PDF-raw value; truth is rounded; strict judge sees literal mismatch304171 MGM AP $302.6M vs truth $303M
C Dataset truth errorTruth itself is wrong102419 Pfizer: agent "No" is factually correct, truth wrong
D Question self-contradictionQuestion text contradicts itself101328 PEP: text says "if not outlined state 0" but truth is $411M from notes
E Retrieval gapVocabulary gap / chunk pollution102119 JPM TBVPS
F Period / semanticCalendar-vs-fiscal mismatch / "best by what" interpretation200460 BBY agent calendar 930→907 vs truth fiscal 982→969

Key insight: 11/18 (61%) are agent behavior, not retrieval failures. With retrieval at dev Hit@5 = 100% / test Hit@5 = 98%, the bottleneck is no longer recall — it's how the agent uses what it retrieved.

This drove fix priority: cheap-but-impactful first (P1, judge-side) → agent behavior (P0 v3 + P4) → retrieval hardening (P3).


Fix 1: Suspect the Judge Before the Agent (3 questions recovered)

Trigger: 04171 MGM Accounts Payable

Question: FY2018 year-end accounts payable for MGM Resorts (USD millions).
Truth:    $303.00

Agent's answer:

The year-end FY2018 accounts payable... was $302,578 thousand, or approximately $302.6 million.

Strict judge:

Answer gives $302.6M, doesn't match truth $303.00M; strict extraction requires exact match → 0.0

The agent isn't wrong — $302.6M is the PDF-raw figure; $303M is the rounded truth. 0.13% gap.

Same pattern: 03473 KO ROA (0.01425 vs 0.01), 04980 PEP capex ($4.625B vs $4.60B).

Fix: deterministic ±2% prefilter before the LLM judge

No agent changes. Just a pre-step in bench/judge.py: for pure_number truths, run a two-stage deterministic check (relative tolerance + round-to-truth-precision); pass → 1.0 directly.

def _judge_numeric_prefilter(truth, answer, tol=0.02):
    """Two-stage check for bare-number truths:
      1. ±tol relative match (handles $302.6 ≈ $303)
      2. Round answer to truth's effective precision (handles 0.01425 → 0.01)
    Also tries ×1000 unit shifts (million/billion)."""
    truth_nums = _extract_numbers(truth)
    answer_nums = _extract_numbers(answer)
    decimals = _truth_effective_decimals(truth)
    for tv in truth_nums:
        for av in answer_nums:
            for factor in (1.0, 1000.0, 1.0/1000.0):
                scaled = av * factor
                if abs(scaled / tv - 1.0) <= tol:
                    return 1.0, "match"
                if decimals >= 0 and round(scaled, decimals) == round(tv, decimals):
                    return 1.0, "round-match"
    return 0.0, "no match"

Two safety constraints:

  1. Only fires on bare-number truths — regex-gated; truths with yes/no, labels, or directional words fall through to the LLM judge
  2. ±2% calibrated on dev set — wider tolerance (e.g. ±10%) would let wrong-magnitude answers through

Results

idTruthAgentPrePost
03473 KO ROA0.010.014250.01.0
04171 MGM AP$303M$302.6M0.01.0
04980 PEP capex$4.60B$4.625B0.01.0

Takeaway

3 questions, 0→1.0, in ~30 min of judge-side code. Zero agent rerun. Sanity-check the evaluator before tuning the agent — instinct says "fix the agent to output rounder numbers"; wasted effort, since the agent is already correct.


Fix 2: Three Agent Pathologies — Reflection + Anti-Refusal + Vocab Expansion (4 recovered + 2 partial)

Three patterns across the 11 Class A failures

  1. Missed question qualifier: question asks "in the future"; agent returns total
  2. Hedge-clause premature exit: question offers "if X not meaningful, explain why"; agent takes the exit without trying
  3. Missed alt-definition: truth uses "operating WC"; agent uses "total WC" — both valid GAAP, agent should dual-answer

One prompt rule can't catch all three. Added three structural constraints.

Change 1: Answer Quality Rules in orchestrator prompt

Added to get_orchestrator_prompt() in project/rag_agent/prompts.py:

Answer Quality Rules:
- Honor question qualifiers exactly. If the question targets a specific subset
  (e.g., "in the future", "remaining", "year-end", "average", "net of X"),
  return that exact quantity — NOT a related total or sibling figure.

- Perform arithmetic when needed. If retrieved data gives components but not
  the final number, DO the math with one short line of working. Example:
    Q asks "remaining" + data shows total=$700M, "90% incurred"
      → answer "remaining = $700M × (100% − 90%) = $70M"

- Attempt calculation before invoking hedge clauses. If a question offers
  an exit ("if X is not meaningful, explain why"), use it ONLY when underlying
  data is genuinely absent. If inputs are present, compute first; optionally
  add a caveat. NEVER refuse a calculation because the metric is unconventional
  for the industry — when the line items are visible in retrieved chunks.

Change 2: Standalone anti_refusal_check node

Prompt rules alone don't catch every refusal. Added a LangGraph node that activates when the draft answer matches refusal patterns ("cannot determine", "data is missing", "insufficient context"). The reviewer is evidence-grounded:

def get_anti_refusal_prompt():
    return """You are a strict evidence-grounded reviewer.

A retrieval agent produced a DRAFT answer that REFUSES the user's question.
Verify whether the refusal is correct.

CRITICAL RULES:
1. Use ONLY information explicitly stated in RETRIEVED CONTEXTS.
2. Do NOT speculate, infer, or fabricate numbers.
3. If contexts contain direct evidence — even partial:
   → Rewrite using that evidence. Cite the source chunk.
   → Use qualifiers ("approximately", "based on available figures").
   → DO NOT refuse; an evidence-backed approximate answer beats refusal.
4. If contexts do NOT contain direct evidence:
   → Confirm the refusal, output draft UNCHANGED.
   → Refusal IS correct when data is genuinely missing — do not invent.
"""

Key design: a prompt-constrained reviewer. It can only rewrite from retrieved contexts, not invent — avoiding the "fix refusal by hallucinating" failure mode.

Change 3: rewrite_query adds ambiguity_detected + retrieval_variants

Two new output fields on the query rewrite step:

  • ambiguity_detected + ambiguous_terms: flag terms like "working capital" (operating vs total), "EBITDA" (GAAP vs adjusted), "best by X" so the agent knows to dual-answer
  • retrieval_variants: translate colloquial query terms ("residual value to equity holders") into the line-item vocabulary filings actually use

Hero case: 01911 MGM Interest Coverage

Question: MGM's FY2022 interest coverage ratio using Adjusted EBIT / Interest Expense.
Truth:    Adjusted EBIT is negative; coverage ratio is zero.

Before:

The earnings release does not present a standalone Adjusted EBIT line item. The closest proxy is taking Operating Income and adding... [computes 2.4x]

Agent uses OpInc as a proxy → 2.4x. Truth is 0. Direction wrong → 0.0.

After P0 v3:

Adjusted EBIT, when properly reconciled from Adjusted EBITDAR through D&A and rent, is negative for FY2022. The coverage ratio is functionally zero or undefined.

Judge:

Correctly identifies Adjusted EBIT is negative, coverage ratio zero → 1.0

The agent learned to accept "negative" as a valid answer instead of forcing a non-ugly proxy.

Other P0 recoveries

idPrePostHow
00222 AMD QR0.51.0Baseline gives standard QR 1.73; P0 triggers alt-definition → recomputes with related-party AR → 1.57 = truth
00605 Ulta repurchase0.01.0Baseline computes 36.5% but refuses due to FY-naming doubt; P0 forces "compute first, caveat second" → 36.5% with FY caveat
00606 Ulta wages0.0 → 0.0 → 1.0P0 alone insufficient; +P3 multi-query unlocks it (see Fix 3)

Partial recoveries

00005 Corning WC (0.5 → 0.5): truth uses operating WC = $831M; agent uses total WC = $2,278M. I expected P0 reflection to trigger dual answer covering both — but agent output two numbers ($2,278M + $2,821M), both total-WC variants, missing the operating-WC formula entirely. Reflection prompt fired, but the agent didn't surface the right alt-definition. Prompt-based reflection is best-effort, not guaranteed.

00540 AES inventory turnover (0.0 → P0 = 0.0 → +P3 = 0.5): see Fix 3.

Takeaway

  • P0 v3 recovered 4 fully + 2 partial — not the 9-10 the analysis doc predicted. Prompt-based interventions are hard to predict.
  • A standalone review node beats prompt rules: the former is evidence-grounded second-pass; the latter relies on the LLM self-policing.
  • Next iteration: bake ambiguity_detected into a structured output schema so the agent is forced to fill dual-answer fields, not just asked to.

Fix 3: Translate Conversational Queries into Filing Vocabulary (a critical co-fix)

Trigger: cases where P0 alone can't save retrieval-poor queries

00606 (Ulta wages) still refuses after P0: "SG&A data is present, but specific wages breakdown not retrieved — cannot make a directional call."

The problem isn't agent judgment — retrieved contexts don't contain SG&A breakdown. A single query "Ulta FY2023 wages as % of net sales" returns SG&A totals, not store-payroll details.

Change: retrieval_variants in query rewrite

Retrieval vocabulary expansion (additional output):
- If the question uses everyday/conceptual vocabulary that SEC filings
  likely express with different terminology, output 1-2 "retrieval variants":
  semantically equivalent rewrites using financial-document vocabulary.
- Principle: filings use accounting line-item language, not business intuition.
  When the question uses an analytical concept ("residual value", "capital
  intensity"), translate to the line items that appear in income statement /
  balance sheet / cash flow / segment notes.
- Be conservative: when the question already uses standard line items
  ("net income", "operating cash flow"), no variants needed.

The original query and each variant run independent retrievals; results are RRF-merged.

Hero case: 00606 Ulta wages (P0 + P3 synergy)

Before (and after P0 alone): refusal — "comparison cannot be completed."

After P0 + P3: rewrite outputs variant "Ulta FY2023 SG&A store payroll deleverage components" → retrieves SG&A breakdown → agent answers "store payroll deleveraged in FY2023, meaning wages as % of net sales increased." → matches truth direction → 1.0.

Co-fix observation: 00540 AES

Same pattern. P0 alone = 0.0. With P3, rewrite splits the ratio into separate retrievals ("AES FY2022 cost of sales" + "AES FY2022 inventory") → income statement line "Total cost of sales: $10,069M" + two-year inventory → agent computes 9.5x = $10,069M / $1,055M.

Judge:

Answer states 12.1x (average inventory) and 9.5x (ending inventory); truth is 9.5x. Answer highlights 12.1x as primary, so one number correct but not as main assertion → 0.5

Partial credit. Agent treated ending inventory as secondary; truth used ending. Fixes have synergy — single-fix verify subsets don't reveal stacking effects.

Takeaway

P3 design principle: filings use accounting language; users use intuition language — translate the gap. Stay conservative: when the question already uses line-item names, don't expand (introduces noise).


Fix 4: Qualifier Ambiguity — Saving One Question, Preventing a Failure Class

Trigger: qualifier ambiguity outside P0 v3 coverage

01902 Best Buy "best USA category by top line" exposed a root mode P0 v3 missed. P0 v3's ambiguity_detected covered terminology ambiguity (working capital / EBITDA / FY fiscal-vs-calendar) but not qualifier ambiguity — superlatives without a specified axis.

Question: Best Buy product category that performed best (by top line) in
          domestic USA market during Q2 FY2024.
Truth:    Entertainment +9% growth (gaming-driven).
Agent baseline: Computing & Mobile Phones (revenue absolute, ~$3.6B).
  • Interpretation A: "top line" = revenue absolute → Computing $14B
  • Interpretation B: "top line" = revenue growth → Entertainment +9%

The same root pattern affects any "best / largest / top / leading / primary / key / main" question without explicit axis — a structural fix, not a single-case patch.

Change: 5-layer layered fix (commit 12e51c9)

A single prompt rule isn't enough — even after detecting qualifier ambiguity, the agent will "silently collapse" to one axis at answer time. Needs end-to-end changes from query rewrite → orchestrator → fan-out → aggregator:

LayerFileChange
Aprompts.py: get_rewrite_query_promptSplit ambiguity into (a) terminology vs (b) qualifier; (b) is non-conservative on superlatives; output format <term> — axes: A | B
Bnodes.py: orchestratorInject AMBIGUITY NOTE with 3 MUST clauses: cover each axis / enumerate per-axis values / never silently collapse
Cprompts.py: get_orchestrator_prompt"Honor ambiguity notes" rule — survives context compression
Dprompts.py rule 4Axis-split fan-out: one rewritten question per axis (max 3), each explicitly naming the axis
D'nodes.py: aggregate_answers + get_aggregation_promptPass ambiguousTerms to aggregator; rule 8 enforces By <axis A>: ...; By <axis B>: ... format

For 01902, fan-out splits into:

  • "Best Buy USA Q2 FY2024 best category by revenue absolute"
  • "Best Buy USA Q2 FY2024 best category by revenue growth"

LangGraph Send() routes each to an independent subgraph → independent retrieval → aggregator stitches into "By revenue: Computing & Mobile Phones $14B; By growth: Entertainment +9% gaming-driven."

Verification (trace-level, 4 cases)

idEffectStatus
01902 BBY best-categorydual answer (revenue + growth)fully recovered
00460 BBY store changecovers fiscal 982→969 + calendar 930→907⚠️ trace improved; final judge credit depends on dual-hedge acceptance
00005 Corning WCdual (narrow + operating)⚠️ known boundary: agent operating = AR+Inv−AP = $2.8B, not truth's (CA−cash)−(CL−ShortTermDebt) = $831M; sub-formula enumeration out of scope
00222 AMD QRdual (1.57 + 0.92 cash)✅ already 1.0; commit notes this case triggered via B/C path (rewrite split on metric-vs-relevance, not axis) — defense-in-depth in action

⚠️ Trace-level verification (output-format check), not a full judge rerun. A future full rerun will confirm the lift.

Takeaway

01902 wasn't fixed by "adding a prompt rule" — it took 5-layer layered changes across rewrite → orchestrator → fan-out → aggregator. Three deeper insights:

  1. B/C is defense-in-depth, not redundancy: 00222's dual answer triggered via the B/C path (metric-vs-relevance), not axis-split. No single layer reliably catches everything.
  2. 00005 boundary shows prompt-only limits: dual answer covers "should we dual-answer?", but not "which sub-formula?". The latter needs a domain knowledge base (GAAP enumeration of WC formulas), not prompts.
  3. Structural fix > one-off patch: 01902's root mode is "superlative-without-axis." Any "best/largest/top" question benefits — the fix saves a class of failures, not just one question.

The 8 Unresolved: Solution in Mind, Not Worth Doing Now

8 questions remain (≤ 0.5). This section isn't "I don't know how to fix" — every one has a concrete solution path. The reason for not fixing each is documented below.

Category 1: Dataset-side issues (4 questions) — outside pipeline scope

idIssue
02419 Pfizer spinoffTruth wrong (Upjohn divested Nov 2020); agent "No" is factually correct
01328 PEP restructuringQuestion self-contradictory: "if not outlined, state 0" but truth $411M from notes
04458 Netflix EBITDA marginDefinition split: full-D&A 56.8% / PP&E-only 5.4%; agent picks former
00283 Pfizer Upjohn futureMixed A+B: agent computes $70M = $700M × 10% (correct behavior), but truth $77.78M (~10% off), judged strictly

Solution: P5 (fix dataset truth or question text). Not in pipeline scope. Production RAG ceiling is structurally bounded by dataset noise — typically 95-97%, not 100%.

Category 2: Fiscal calendar mismatch — needs a company metadata layer (1 question)

`00460` BBY stores:
   Truth: 982 → 969 (fiscal-year aligned)
   Agent: 930 → 907 (calendar-year aligned)

Best Buy's fiscal year ends in late January; "Q2 FY2024" = quarter ending July 2023 in their fiscal calendar. The agent likely interpreted as calendar Q2 (June 2022 / 2023).

Solution: Not solvable by prompt engineering — needs a company metadata layer:

company_metadata = {
    "BBY":   {fiscal_year_end_month: 1, ticker: "BBY", industry: "Specialty Retail", ...},
    "AAPL":  {fiscal_year_end_month: 9, ticker: "AAPL", industry: "Tech Hardware", ...},
    ...
}

Agent flow: query contains a period reference → look up company_metadata first to resolve fiscal-to-calendar mapping → then retrieve. This is the layer Bloomberg Terminal / Capital IQ / FactSet maintain internally.

Why not now: real infrastructure work — data sourcing (SEC EDGAR? Compustat? manual?), schema, agent state integration, maintenance. Likely 1-2 weeks, outside this sprint. Mandatory for any production financial QA system.

Category 3: P0 reflection misfires (2 questions) — solution clear, ROI marginal

idShouldActual
00299 JPM lowest segment Q1 2021Dual: "Corporate −$473M / 4-reportable CB $2,393M"Excludes Corporate ("not a typical reportable segment"), answers CB $2,393M. Wrong direction.
00790 CVS capital-intensive yes/no"Yes" + caveat (ROA 1.82%)Refuses, claims insufficient data

Solution:

  • 00299: move reflection from advisory to enforced structured output — add a Pydantic schema requiring interpretations: List[str] (≥ 2 entries for ambiguous cases). Agent can't silently skip — schema validation fails.
  • 00790: tighten anti-refusal rule for yes/no — if retrieved contexts have any indirect signal (ROA, ratio, trend), require a directional answer + caveat. Refusal only when no related figures exist.

Why not now:

  1. Invasive: enforced structured output touches graph state schema and the final-answer node; regression risk on currently-passing questions
  2. Side effects: tighter anti-refusal might turn legitimate "data truly missing" refusals into hallucinations
  3. Diminishing returns: more reflection rules might recover 1-2 questions while regressing 3-4; net unclear

Deferred until similar patterns appear at scale in production data.

Category 4: ROI-not-worth — P2 architectural change (1 question)

`02119` JPM hypothetical liquidation value per share:
   Truth: $66.56 (= TBVPS)
   Agent: "Cannot calculate; Q1 2021 balance sheet not retrieved"

Two stacked issues:

  1. Vocab gap: question says "bankrupt / liquidate / per shareholder"; truth chunk says "Tangible Book Value Per Share (TBVPS) = $66.56". P3 retrieval_variants didn't bridge this specific gap.
  2. Fragmented-table pollution: JPM 2021 Q1 10-Q segment table on p003 split into 9 chunks, dominating top-10 candidates. The complete p006 chunk containing TBVPS got squeezed out.

Solution: P2 — Table-aware chunking + Auto-promote

  • Table-aware chunking (ingest side): identify markdown table structure; split by logical row groups (not character count); duplicate the header in every chunk so each is self-readable.
  • Auto-promote (retrieval side): after rerank, if multiple top-K chunks share the same parent_id (≥ N times), auto-merge them into the complete parent chunk to avoid fragmented placeholders.

Engineering scope:

  • Rewrite project/document_chunker.py (table-aware logic)
  • Modify project/rag_agent/tools.py: _search_child_chunks (auto-promote logic)
  • Reprocess all 168 PDFs + rebuild Qdrant index (child chunk schema changes)
  • Large blast radius: chunk distribution shifts may regress currently-passing questions

Why not now: ~16h + reindex, recovers 1 question, with regression risk on the 17 passing ones. Textbook ROI miss.

But to be clear: this isn't "the design is bad" — it's "current sample size doesn't justify it." If fragmented-table failures occur at scale in production (e.g., >5 questions), P2 immediately becomes worthwhile — it improves global retrieval quality (less chunk pollution, fuller context), not just this one question.


Closing: 4 Takeaways + the Logic of Not Fixing

4 takeaways

1. Suspect the evaluator before suspecting the model. P1 was 30 minutes of judge code; 3 questions 0→1.0; zero agent rerun. Instinct says "fix the agent to output rounder numbers" — wasted effort, the agent was already correct. Sanity-checking the judge matters more than reaching for a bigger model.

2. Prompt interventions are best-effort; structural fixes are reliable. P0 v3 was forecast to recover 9-10 questions; actual was 4 fully + 2 partial. Reflection prompts trigger inconsistently — same case might go through the ambiguity branch one run and collapse to single answer the next. For reliability, push ambiguity into a structured output schema that forces the agent to fill the field. P4's 5-layer layered fix extends this: query rewrite + orchestrator MUST clause + answer-quality rule + fan-out + aggregator format constraint, end-to-end defense-in-depth.

3. Fixes have synergy — single-fix verification isn't enough. 00606 alone with P0 didn't recover; +P3 got it to 1.0. 00540 similar. Single-fix verify subsets miss this stacking. Run verification on the stacked pipeline, not isolated unit tests — the latter lies to you in multi-fix systems.

4. Failure classification drives priority — don't patch indiscriminately. "61% are agent behavior" directly directed engineering effort to P0 / P4, not retrieval (already at Hit@5 = 100%). Without classification, the instinct is to keep tuning the reranker — could have wasted a week. Understanding the failure distribution before acting is the biggest return from the deep-dive.

The trade-off: 4 reasons not to fix

Engineering isn't "fix everything" — it's knowing when to stop. The 8 unfixed fall into 4 categories:

CategoryCountReason not to fix now
Dataset-side (truth wrong / question contradicts / definition split)4Outside pipeline scope; production RAG ceiling is ~95-97%
Missing infrastructure (company metadata layer)11-2 weeks of real infra work; the layer Bloomberg / Capital IQ maintain internally; outside sprint, but mandatory for production
Architectural fix (table-aware chunking + reindex)116h + reindex + regression risk; not worth for one question at current sample size
Diminishing returns on agent prompts2Each new reflection rule may regress others; net unclear

The judgment isn't "is this change good?" — it's "is it worth doing at this sample size and time budget?" Same P2 change: not worth at 18-question sample, essential at production scale with 5% similar failures. Making this judgment is harder than knowing how to fix.

Why this kind of writeup matters

"I pushed correctness from 0.871 to 0.919" is a number anyone can recite. Explaining each of the 18 failures, what was changed, and why 8 weren't fixed is ownership.

If the project boiled down to one thing, I'd point to P4. It didn't save one question — it addressed an entire class of failures ("superlative-without-axis") that production will encounter repeatedly. Saving one question, preventing a failure class — that's what makes the engineering worth it.