Financial RAG Agent Optimization:Methods, Cases, and Data
Built an agentic RAG system for financial Q&A on FinanceBench. Once the baseline hit 0.87, the interesting work began: pushing it higher.
Project Setup
Dataset: FinanceBench — 150 questions across 168 page-level 10-K / 10-Q / 8-K / earnings release PDFs. Question types range from fact lookup ("3M's FY2018 capex") to interpretation ("Is Pfizer spinning off Upjohn?").
Baseline pipeline:
- Retrieval: voyage-finance-2 dense (1024d) + Qdrant BM25 sparse + RRF fusion + company-aware payload filter + voyage rerank-2
- Agent: LangGraph multi-step, Claude sonnet-4-6
- Eval: custom GPT-4.1 strict judge (0 / 0.5 / 1.0, category-calibrated for 6 FinanceBench truth types). Not RAGAS auto-correctness — RAGAS is too lenient on numeric precision (agent returns PDF-raw $302.6M vs truth $303M; RAGAS penalizes, but the agent is correct).
Results:
- dev (45): 0.922 (41.5/45)
- test (105): 0.871 (91.5/105)
- 18 failures total (16 at 0.0, 2 at 0.5)
A single score hides the shape of failure. I split each of the 18 and applied 4 independent fixes (P1 / P0 v3 / P3 / P4), recovering 10 and pushing test to ~0.919. The remaining 8 are unresolved — each with a documented reason.
18 Failures in 6 Categories
| Class | Meaning | Count | Representative |
|---|---|---|---|
| A Output discipline | Agent behavior: refusal / over-reasoning / missed qualifier / missed alt-definition | 11 | 00540 AES refuses to compute inventory turnover |
| B Judge over-strict | Agent gives PDF-raw value; truth is rounded; strict judge sees literal mismatch | 3 | 04171 MGM AP $302.6M vs truth $303M |
| C Dataset truth error | Truth itself is wrong | 1 | 02419 Pfizer: agent "No" is factually correct, truth wrong |
| D Question self-contradiction | Question text contradicts itself | 1 | 01328 PEP: text says "if not outlined state 0" but truth is $411M from notes |
| E Retrieval gap | Vocabulary gap / chunk pollution | 1 | 02119 JPM TBVPS |
| F Period / semantic | Calendar-vs-fiscal mismatch / "best by what" interpretation | 2 | 00460 BBY agent calendar 930→907 vs truth fiscal 982→969 |
Key insight: 11/18 (61%) are agent behavior, not retrieval failures. With retrieval at dev Hit@5 = 100% / test Hit@5 = 98%, the bottleneck is no longer recall — it's how the agent uses what it retrieved.
This drove fix priority: cheap-but-impactful first (P1, judge-side) → agent behavior (P0 v3 + P4) → retrieval hardening (P3).
Fix 1: Suspect the Judge Before the Agent (3 questions recovered)
Trigger: 04171 MGM Accounts Payable
Question: FY2018 year-end accounts payable for MGM Resorts (USD millions).
Truth: $303.00
Agent's answer:
The year-end FY2018 accounts payable... was $302,578 thousand, or approximately $302.6 million.
Strict judge:
Answer gives $302.6M, doesn't match truth $303.00M; strict extraction requires exact match → 0.0
The agent isn't wrong — $302.6M is the PDF-raw figure; $303M is the rounded truth. 0.13% gap.
Same pattern: 03473 KO ROA (0.01425 vs 0.01), 04980 PEP capex ($4.625B vs $4.60B).
Fix: deterministic ±2% prefilter before the LLM judge
No agent changes. Just a pre-step in bench/judge.py: for pure_number truths, run a two-stage deterministic check (relative tolerance + round-to-truth-precision); pass → 1.0 directly.
def _judge_numeric_prefilter(truth, answer, tol=0.02):
"""Two-stage check for bare-number truths:
1. ±tol relative match (handles $302.6 ≈ $303)
2. Round answer to truth's effective precision (handles 0.01425 → 0.01)
Also tries ×1000 unit shifts (million/billion)."""
truth_nums = _extract_numbers(truth)
answer_nums = _extract_numbers(answer)
decimals = _truth_effective_decimals(truth)
for tv in truth_nums:
for av in answer_nums:
for factor in (1.0, 1000.0, 1.0/1000.0):
scaled = av * factor
if abs(scaled / tv - 1.0) <= tol:
return 1.0, "match"
if decimals >= 0 and round(scaled, decimals) == round(tv, decimals):
return 1.0, "round-match"
return 0.0, "no match"
Two safety constraints:
- Only fires on bare-number truths — regex-gated; truths with yes/no, labels, or directional words fall through to the LLM judge
- ±2% calibrated on dev set — wider tolerance (e.g. ±10%) would let wrong-magnitude answers through
Results
| id | Truth | Agent | Pre | Post |
|---|---|---|---|---|
03473 KO ROA | 0.01 | 0.01425 | 0.0 | 1.0 |
04171 MGM AP | $303M | $302.6M | 0.0 | 1.0 |
04980 PEP capex | $4.60B | $4.625B | 0.0 | 1.0 |
Takeaway
3 questions, 0→1.0, in ~30 min of judge-side code. Zero agent rerun. Sanity-check the evaluator before tuning the agent — instinct says "fix the agent to output rounder numbers"; wasted effort, since the agent is already correct.
Fix 2: Three Agent Pathologies — Reflection + Anti-Refusal + Vocab Expansion (4 recovered + 2 partial)
Three patterns across the 11 Class A failures
- Missed question qualifier: question asks "in the future"; agent returns total
- Hedge-clause premature exit: question offers "if X not meaningful, explain why"; agent takes the exit without trying
- Missed alt-definition: truth uses "operating WC"; agent uses "total WC" — both valid GAAP, agent should dual-answer
One prompt rule can't catch all three. Added three structural constraints.
Change 1: Answer Quality Rules in orchestrator prompt
Added to get_orchestrator_prompt() in project/rag_agent/prompts.py:
Answer Quality Rules:
- Honor question qualifiers exactly. If the question targets a specific subset
(e.g., "in the future", "remaining", "year-end", "average", "net of X"),
return that exact quantity — NOT a related total or sibling figure.
- Perform arithmetic when needed. If retrieved data gives components but not
the final number, DO the math with one short line of working. Example:
Q asks "remaining" + data shows total=$700M, "90% incurred"
→ answer "remaining = $700M × (100% − 90%) = $70M"
- Attempt calculation before invoking hedge clauses. If a question offers
an exit ("if X is not meaningful, explain why"), use it ONLY when underlying
data is genuinely absent. If inputs are present, compute first; optionally
add a caveat. NEVER refuse a calculation because the metric is unconventional
for the industry — when the line items are visible in retrieved chunks.
Change 2: Standalone anti_refusal_check node
Prompt rules alone don't catch every refusal. Added a LangGraph node that activates when the draft answer matches refusal patterns ("cannot determine", "data is missing", "insufficient context"). The reviewer is evidence-grounded:
def get_anti_refusal_prompt():
return """You are a strict evidence-grounded reviewer.
A retrieval agent produced a DRAFT answer that REFUSES the user's question.
Verify whether the refusal is correct.
CRITICAL RULES:
1. Use ONLY information explicitly stated in RETRIEVED CONTEXTS.
2. Do NOT speculate, infer, or fabricate numbers.
3. If contexts contain direct evidence — even partial:
→ Rewrite using that evidence. Cite the source chunk.
→ Use qualifiers ("approximately", "based on available figures").
→ DO NOT refuse; an evidence-backed approximate answer beats refusal.
4. If contexts do NOT contain direct evidence:
→ Confirm the refusal, output draft UNCHANGED.
→ Refusal IS correct when data is genuinely missing — do not invent.
"""
Key design: a prompt-constrained reviewer. It can only rewrite from retrieved contexts, not invent — avoiding the "fix refusal by hallucinating" failure mode.
Change 3: rewrite_query adds ambiguity_detected + retrieval_variants
Two new output fields on the query rewrite step:
- ambiguity_detected + ambiguous_terms: flag terms like "working capital" (operating vs total), "EBITDA" (GAAP vs adjusted), "best by X" so the agent knows to dual-answer
- retrieval_variants: translate colloquial query terms ("residual value to equity holders") into the line-item vocabulary filings actually use
Hero case: 01911 MGM Interest Coverage
Question: MGM's FY2022 interest coverage ratio using Adjusted EBIT / Interest Expense.
Truth: Adjusted EBIT is negative; coverage ratio is zero.
Before:
The earnings release does not present a standalone Adjusted EBIT line item. The closest proxy is taking Operating Income and adding... [computes 2.4x]
Agent uses OpInc as a proxy → 2.4x. Truth is 0. Direction wrong → 0.0.
After P0 v3:
Adjusted EBIT, when properly reconciled from Adjusted EBITDAR through D&A and rent, is negative for FY2022. The coverage ratio is functionally zero or undefined.
Judge:
Correctly identifies Adjusted EBIT is negative, coverage ratio zero → 1.0
The agent learned to accept "negative" as a valid answer instead of forcing a non-ugly proxy.
Other P0 recoveries
| id | Pre | Post | How |
|---|---|---|---|
00222 AMD QR | 0.5 | 1.0 | Baseline gives standard QR 1.73; P0 triggers alt-definition → recomputes with related-party AR → 1.57 = truth |
00605 Ulta repurchase | 0.0 | 1.0 | Baseline computes 36.5% but refuses due to FY-naming doubt; P0 forces "compute first, caveat second" → 36.5% with FY caveat |
00606 Ulta wages | 0.0 → 0.0 → 1.0 | P0 alone insufficient; +P3 multi-query unlocks it (see Fix 3) |
Partial recoveries
00005 Corning WC (0.5 → 0.5): truth uses operating WC = $831M; agent uses total WC = $2,278M. I expected P0 reflection to trigger dual answer covering both — but agent output two numbers ($2,278M + $2,821M), both total-WC variants, missing the operating-WC formula entirely. Reflection prompt fired, but the agent didn't surface the right alt-definition. Prompt-based reflection is best-effort, not guaranteed.
00540 AES inventory turnover (0.0 → P0 = 0.0 → +P3 = 0.5): see Fix 3.
Takeaway
- P0 v3 recovered 4 fully + 2 partial — not the 9-10 the analysis doc predicted. Prompt-based interventions are hard to predict.
- A standalone review node beats prompt rules: the former is evidence-grounded second-pass; the latter relies on the LLM self-policing.
- Next iteration: bake
ambiguity_detectedinto a structured output schema so the agent is forced to fill dual-answer fields, not just asked to.
Fix 3: Translate Conversational Queries into Filing Vocabulary (a critical co-fix)
Trigger: cases where P0 alone can't save retrieval-poor queries
00606 (Ulta wages) still refuses after P0: "SG&A data is present, but specific wages breakdown not retrieved — cannot make a directional call."
The problem isn't agent judgment — retrieved contexts don't contain SG&A breakdown. A single query "Ulta FY2023 wages as % of net sales" returns SG&A totals, not store-payroll details.
Change: retrieval_variants in query rewrite
Retrieval vocabulary expansion (additional output):
- If the question uses everyday/conceptual vocabulary that SEC filings
likely express with different terminology, output 1-2 "retrieval variants":
semantically equivalent rewrites using financial-document vocabulary.
- Principle: filings use accounting line-item language, not business intuition.
When the question uses an analytical concept ("residual value", "capital
intensity"), translate to the line items that appear in income statement /
balance sheet / cash flow / segment notes.
- Be conservative: when the question already uses standard line items
("net income", "operating cash flow"), no variants needed.
The original query and each variant run independent retrievals; results are RRF-merged.
Hero case: 00606 Ulta wages (P0 + P3 synergy)
Before (and after P0 alone): refusal — "comparison cannot be completed."
After P0 + P3: rewrite outputs variant "Ulta FY2023 SG&A store payroll deleverage components" → retrieves SG&A breakdown → agent answers "store payroll deleveraged in FY2023, meaning wages as % of net sales increased." → matches truth direction → 1.0.
Co-fix observation: 00540 AES
Same pattern. P0 alone = 0.0. With P3, rewrite splits the ratio into separate retrievals ("AES FY2022 cost of sales" + "AES FY2022 inventory") → income statement line "Total cost of sales: $10,069M" + two-year inventory → agent computes 9.5x = $10,069M / $1,055M.
Judge:
Answer states 12.1x (average inventory) and 9.5x (ending inventory); truth is 9.5x. Answer highlights 12.1x as primary, so one number correct but not as main assertion → 0.5
Partial credit. Agent treated ending inventory as secondary; truth used ending. Fixes have synergy — single-fix verify subsets don't reveal stacking effects.
Takeaway
P3 design principle: filings use accounting language; users use intuition language — translate the gap. Stay conservative: when the question already uses line-item names, don't expand (introduces noise).
Fix 4: Qualifier Ambiguity — Saving One Question, Preventing a Failure Class
Trigger: qualifier ambiguity outside P0 v3 coverage
01902 Best Buy "best USA category by top line" exposed a root mode P0 v3 missed. P0 v3's ambiguity_detected covered terminology ambiguity (working capital / EBITDA / FY fiscal-vs-calendar) but not qualifier ambiguity — superlatives without a specified axis.
Question: Best Buy product category that performed best (by top line) in
domestic USA market during Q2 FY2024.
Truth: Entertainment +9% growth (gaming-driven).
Agent baseline: Computing & Mobile Phones (revenue absolute, ~$3.6B).
- Interpretation A: "top line" = revenue absolute → Computing $14B
- Interpretation B: "top line" = revenue growth → Entertainment +9%
The same root pattern affects any "best / largest / top / leading / primary / key / main" question without explicit axis — a structural fix, not a single-case patch.
Change: 5-layer layered fix (commit 12e51c9)
A single prompt rule isn't enough — even after detecting qualifier ambiguity, the agent will "silently collapse" to one axis at answer time. Needs end-to-end changes from query rewrite → orchestrator → fan-out → aggregator:
| Layer | File | Change |
|---|---|---|
| A | prompts.py: get_rewrite_query_prompt | Split ambiguity into (a) terminology vs (b) qualifier; (b) is non-conservative on superlatives; output format <term> — axes: A | B |
| B | nodes.py: orchestrator | Inject AMBIGUITY NOTE with 3 MUST clauses: cover each axis / enumerate per-axis values / never silently collapse |
| C | prompts.py: get_orchestrator_prompt | "Honor ambiguity notes" rule — survives context compression |
| D | prompts.py rule 4 | Axis-split fan-out: one rewritten question per axis (max 3), each explicitly naming the axis |
| D' | nodes.py: aggregate_answers + get_aggregation_prompt | Pass ambiguousTerms to aggregator; rule 8 enforces By <axis A>: ...; By <axis B>: ... format |
For 01902, fan-out splits into:
- "Best Buy USA Q2 FY2024 best category by revenue absolute"
- "Best Buy USA Q2 FY2024 best category by revenue growth"
LangGraph Send() routes each to an independent subgraph → independent retrieval → aggregator stitches into "By revenue: Computing & Mobile Phones $14B; By growth: Entertainment +9% gaming-driven."
Verification (trace-level, 4 cases)
| id | Effect | Status |
|---|---|---|
01902 BBY best-category | dual answer (revenue + growth) | ✅ fully recovered |
00460 BBY store change | covers fiscal 982→969 + calendar 930→907 | ⚠️ trace improved; final judge credit depends on dual-hedge acceptance |
00005 Corning WC | dual (narrow + operating) | ⚠️ known boundary: agent operating = AR+Inv−AP = $2.8B, not truth's (CA−cash)−(CL−ShortTermDebt) = $831M; sub-formula enumeration out of scope |
00222 AMD QR | dual (1.57 + 0.92 cash) | ✅ already 1.0; commit notes this case triggered via B/C path (rewrite split on metric-vs-relevance, not axis) — defense-in-depth in action |
⚠️ Trace-level verification (output-format check), not a full judge rerun. A future full rerun will confirm the lift.
Takeaway
01902 wasn't fixed by "adding a prompt rule" — it took 5-layer layered changes across rewrite → orchestrator → fan-out → aggregator. Three deeper insights:
- B/C is defense-in-depth, not redundancy: 00222's dual answer triggered via the B/C path (metric-vs-relevance), not axis-split. No single layer reliably catches everything.
- 00005 boundary shows prompt-only limits: dual answer covers "should we dual-answer?", but not "which sub-formula?". The latter needs a domain knowledge base (GAAP enumeration of WC formulas), not prompts.
- Structural fix > one-off patch: 01902's root mode is "superlative-without-axis." Any "best/largest/top" question benefits — the fix saves a class of failures, not just one question.
The 8 Unresolved: Solution in Mind, Not Worth Doing Now
8 questions remain (≤ 0.5). This section isn't "I don't know how to fix" — every one has a concrete solution path. The reason for not fixing each is documented below.
Category 1: Dataset-side issues (4 questions) — outside pipeline scope
| id | Issue |
|---|---|
02419 Pfizer spinoff | Truth wrong (Upjohn divested Nov 2020); agent "No" is factually correct |
01328 PEP restructuring | Question self-contradictory: "if not outlined, state 0" but truth $411M from notes |
04458 Netflix EBITDA margin | Definition split: full-D&A 56.8% / PP&E-only 5.4%; agent picks former |
00283 Pfizer Upjohn future | Mixed A+B: agent computes $70M = $700M × 10% (correct behavior), but truth $77.78M (~10% off), judged strictly |
Solution: P5 (fix dataset truth or question text). Not in pipeline scope. Production RAG ceiling is structurally bounded by dataset noise — typically 95-97%, not 100%.
Category 2: Fiscal calendar mismatch — needs a company metadata layer (1 question)
`00460` BBY stores:
Truth: 982 → 969 (fiscal-year aligned)
Agent: 930 → 907 (calendar-year aligned)
Best Buy's fiscal year ends in late January; "Q2 FY2024" = quarter ending July 2023 in their fiscal calendar. The agent likely interpreted as calendar Q2 (June 2022 / 2023).
Solution: Not solvable by prompt engineering — needs a company metadata layer:
company_metadata = {
"BBY": {fiscal_year_end_month: 1, ticker: "BBY", industry: "Specialty Retail", ...},
"AAPL": {fiscal_year_end_month: 9, ticker: "AAPL", industry: "Tech Hardware", ...},
...
}
Agent flow: query contains a period reference → look up company_metadata first to resolve fiscal-to-calendar mapping → then retrieve. This is the layer Bloomberg Terminal / Capital IQ / FactSet maintain internally.
Why not now: real infrastructure work — data sourcing (SEC EDGAR? Compustat? manual?), schema, agent state integration, maintenance. Likely 1-2 weeks, outside this sprint. Mandatory for any production financial QA system.
Category 3: P0 reflection misfires (2 questions) — solution clear, ROI marginal
| id | Should | Actual |
|---|---|---|
00299 JPM lowest segment Q1 2021 | Dual: "Corporate −$473M / 4-reportable CB $2,393M" | Excludes Corporate ("not a typical reportable segment"), answers CB $2,393M. Wrong direction. |
00790 CVS capital-intensive yes/no | "Yes" + caveat (ROA 1.82%) | Refuses, claims insufficient data |
Solution:
- 00299: move reflection from advisory to enforced structured output — add a Pydantic schema requiring
interpretations: List[str](≥ 2 entries for ambiguous cases). Agent can't silently skip — schema validation fails. - 00790: tighten anti-refusal rule for yes/no — if retrieved contexts have any indirect signal (ROA, ratio, trend), require a directional answer + caveat. Refusal only when no related figures exist.
Why not now:
- Invasive: enforced structured output touches graph state schema and the final-answer node; regression risk on currently-passing questions
- Side effects: tighter anti-refusal might turn legitimate "data truly missing" refusals into hallucinations
- Diminishing returns: more reflection rules might recover 1-2 questions while regressing 3-4; net unclear
Deferred until similar patterns appear at scale in production data.
Category 4: ROI-not-worth — P2 architectural change (1 question)
`02119` JPM hypothetical liquidation value per share:
Truth: $66.56 (= TBVPS)
Agent: "Cannot calculate; Q1 2021 balance sheet not retrieved"
Two stacked issues:
- Vocab gap: question says "bankrupt / liquidate / per shareholder"; truth chunk says "Tangible Book Value Per Share (TBVPS) = $66.56". P3 retrieval_variants didn't bridge this specific gap.
- Fragmented-table pollution: JPM 2021 Q1 10-Q segment table on p003 split into 9 chunks, dominating top-10 candidates. The complete p006 chunk containing TBVPS got squeezed out.
Solution: P2 — Table-aware chunking + Auto-promote
- Table-aware chunking (ingest side): identify markdown table structure; split by logical row groups (not character count); duplicate the header in every chunk so each is self-readable.
- Auto-promote (retrieval side): after rerank, if multiple top-K chunks share the same
parent_id(≥ N times), auto-merge them into the complete parent chunk to avoid fragmented placeholders.
Engineering scope:
- Rewrite
project/document_chunker.py(table-aware logic) - Modify
project/rag_agent/tools.py: _search_child_chunks(auto-promote logic) - Reprocess all 168 PDFs + rebuild Qdrant index (child chunk schema changes)
- Large blast radius: chunk distribution shifts may regress currently-passing questions
Why not now: ~16h + reindex, recovers 1 question, with regression risk on the 17 passing ones. Textbook ROI miss.
But to be clear: this isn't "the design is bad" — it's "current sample size doesn't justify it." If fragmented-table failures occur at scale in production (e.g., >5 questions), P2 immediately becomes worthwhile — it improves global retrieval quality (less chunk pollution, fuller context), not just this one question.
Closing: 4 Takeaways + the Logic of Not Fixing
4 takeaways
1. Suspect the evaluator before suspecting the model. P1 was 30 minutes of judge code; 3 questions 0→1.0; zero agent rerun. Instinct says "fix the agent to output rounder numbers" — wasted effort, the agent was already correct. Sanity-checking the judge matters more than reaching for a bigger model.
2. Prompt interventions are best-effort; structural fixes are reliable. P0 v3 was forecast to recover 9-10 questions; actual was 4 fully + 2 partial. Reflection prompts trigger inconsistently — same case might go through the ambiguity branch one run and collapse to single answer the next. For reliability, push ambiguity into a structured output schema that forces the agent to fill the field. P4's 5-layer layered fix extends this: query rewrite + orchestrator MUST clause + answer-quality rule + fan-out + aggregator format constraint, end-to-end defense-in-depth.
3. Fixes have synergy — single-fix verification isn't enough. 00606 alone with P0 didn't recover; +P3 got it to 1.0. 00540 similar. Single-fix verify subsets miss this stacking. Run verification on the stacked pipeline, not isolated unit tests — the latter lies to you in multi-fix systems.
4. Failure classification drives priority — don't patch indiscriminately. "61% are agent behavior" directly directed engineering effort to P0 / P4, not retrieval (already at Hit@5 = 100%). Without classification, the instinct is to keep tuning the reranker — could have wasted a week. Understanding the failure distribution before acting is the biggest return from the deep-dive.
The trade-off: 4 reasons not to fix
Engineering isn't "fix everything" — it's knowing when to stop. The 8 unfixed fall into 4 categories:
| Category | Count | Reason not to fix now |
|---|---|---|
| Dataset-side (truth wrong / question contradicts / definition split) | 4 | Outside pipeline scope; production RAG ceiling is ~95-97% |
| Missing infrastructure (company metadata layer) | 1 | 1-2 weeks of real infra work; the layer Bloomberg / Capital IQ maintain internally; outside sprint, but mandatory for production |
| Architectural fix (table-aware chunking + reindex) | 1 | 16h + reindex + regression risk; not worth for one question at current sample size |
| Diminishing returns on agent prompts | 2 | Each new reflection rule may regress others; net unclear |
The judgment isn't "is this change good?" — it's "is it worth doing at this sample size and time budget?" Same P2 change: not worth at 18-question sample, essential at production scale with 5% similar failures. Making this judgment is harder than knowing how to fix.
Why this kind of writeup matters
"I pushed correctness from 0.871 to 0.919" is a number anyone can recite. Explaining each of the 18 failures, what was changed, and why 8 weren't fixed is ownership.
If the project boiled down to one thing, I'd point to P4. It didn't save one question — it addressed an entire class of failures ("superlative-without-axis") that production will encounter repeatedly. Saving one question, preventing a failure class — that's what makes the engineering worth it.