Reducing False Positives in RAG Semantic Caching: A Banking Case Study

Reducing False Positives in RAG Semantic Caching: A Banking Case Study

Reducing False Positives in RAG with Semantic Caching: A Production Case Study

image

Author: Elakkiya Daivam

This article examines why Retrieval-Augmented Generation (RAG) and semantic caching are powerful for minimizing false positives in AI applications, based on a production-level evaluation of seven bi-encoder models against 1,000 query variations.

image

---

The Rise of Natural Language Interfaces

As natural language becomes the default medium for interacting with software—whether through intelligent search, chatbots, analytics assistants, or enterprise knowledge explorers—systems must process large volumes of queries that differ in wording but share the same intent.

Key requirements:

  • Fast, accurate retrieval
  • Reduced redundant LLM calls
  • Consistency in responses
  • Cost efficiency

---

Understanding Semantic Caching

Semantic caching stores queries and answers as vector embeddings, enabling reuse when new queries have similar meanings.

Compared to string-based caching, it operates on meaning and intent, bridging cases where phrasing changes but the user's need stays the same.

Benefits in production:

  • Faster responses
  • Stable output quality
  • Lower compute costs

Risk: Poorly designed caching can produce false positives—semantically close but incorrect matches—undermining trust.

---

Initial Deployment Challenges

In a financial services FAQ deployment, we encountered:

  • High-confidence wrong matches (e.g., account closure queries rerouted to payment cancellation).
  • Extreme mismatch rates: Some models with a 0.7 similarity threshold showed up to 99% false positives.

Lesson: A validated embedding model and threshold alone are not enough—cache design must be robust.

---

Experimental Methodology

System Architecture

  • Query-to-query semantic cache using FAISS.
  • Workflow:
  • Convert incoming queries to embeddings.
  • Compare against cached embeddings.
  • Return gold-standard answer if similarity ≥ threshold; otherwise, query LLM and store result.
image

Figure 1: High-Level Semantic Cache Architecture

Cache Modes

  • Incremental Mode: Empty cache at start; grow on misses.
  • Pre-Cache Mode: Start preloaded with 100 gold answers + 300 crafted distractors.

Infrastructure

  • AWS g4dn.xlarge (NVIDIA T4 GPU, 16GB memory)
  • 1,000 query variations tested
  • Models: all-MiniLM-L6-v2, e5-large-v2, mxbai-embed-large-v1, bge-m3, Qwen3-Embedding-0.6B, jina-embeddings-v2-base-en, instructor-large
  • Dataset: 100 banking FAQs
  • Validation accuracy: 99.7%
  • Metrics: Cache hit %, LLM hit %, FP %, Recall@1, Recall@3, latency

---

Dataset Design

Real-world banking queries from 10 domains (payments, loans, disputes, accounts, investments, ATMs, etc.).

Each FAQ → 10 variations + 3 distractor types:

  • topical_neighbor: 0.8–0.9 similarity
  • semantic_near_miss: 0.85–0.95 similarity
  • cross_domain: 0.6–0.8 similarity

Purpose: Test semantic precision under realistic conditions.

---

Model Selection Considerations

Example FAQ:

"How do I cancel a Zelle payment?"

Core Intent

  • Cancel if recipient not enrolled
  • Instant transactions are irreversible
  • Unenrolled transactions cancelable via app or support

Query Variation Coverage

  • Formal, casual, typos, slang, urgency, vague phrasing

Distractor Handling

  • Same domain, different actions (view history)
  • Similar mechanism, different service (wire transfer reversal)
  • Different payment system entirely (credit card dispute)
  • Hybrid matching: Dense embeddings → top candidates → LLM re-ranking
  • Preprocessing: Spell correction, slang normalization
  • Action/domain filters to ensure intent match

---

Experiment Results

Experiment 1: Zero-Shot Baseline

image
image

Findings:

  • Default threshold (0.7) → extremely high FP rates (up to 99%).
  • Domain context absence = inaccurate matches.

---

Experiment 2: Similarity Threshold Optimization

image
image

Outcome:

  • Threshold tuning helped but did not eliminate high FP rates.
  • Lower FPs came at the expense of higher cache misses → more costly LLM calls.

---

Experiment 3: Optimal Candidate Principle

Key Insight:

Strong candidate availability matters more than search optimization.

Approach:

  • Preload cache with 100 gold answers + 300 distractors
  • Simulate real-world similarity boundaries

Improvement:

  • FP rate ↓ 59%
  • Cache hit rate ↑ to 68.4%–84.9%
image
image

---

Experiment 4: Cache Quality Control

image
image

Strategy:

  • Filter ambiguous, typo-heavy, or vague queries before caching

Result:

  • FP rates < 6% for all but one model
  • Best performer: Large teacher model → 3.8% FP (96.2% reduction)

---

Conclusion

From 99% to 3.8% false positives — the transformation required:

  • Architectural redesign (Optimal Candidate Principle)
  • Cache quality safeguards
  • Domain-aware preloading

Preferred models:

  • Best accuracy: Large teacher model
  • Balanced: bge-m3
  • Low latency: all-MiniLM-L6-v2
  • Avoid: e5-large-v2 (high FP persistency)

---

Roadmap for <2% False Positives

Multi-Layer Improvements

  • Advanced preprocessing (typo/slang normalization)
  • Domain-specific fine-tuning
  • Multi-vector representation (content, intent, context)
  • Cross-encoder re-ranking
  • Rule-based domain validation layer

---

Lessons for Any RAG System

  • Cache design > threshold tuning
  • Preprocessing is mandatory to avoid polluting the cache
  • Pure similarity-based matching struggles with fine-grained intent differences

---

Final Thought

Fix the cache first. Then tune models. Architectural principles—not bigger embeddings—turn prototypes into production systems.

Original article:

https://www.infoq.com/articles/reducing-false-positives-retrieval-augmented-generation/

---

Platforms like AiToEarn (官网, 博客, 开源) extend these ideas by combining AI generation, multi-platform publishing, analytics, and model ranking—ensuring not just accurate retrieval but also scalable content monetization across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter).

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.