Production AI
Reducing False Positives in RAG Semantic Caching: A Banking Case Study

Honghao Wang

14 Nov 2025 — 3 min read
## Key Takeaways

- **Semantic caching** — a **Retrieval-Augmented Generation (RAG)** technique — stores queries and responses as **vector embeddings** for reuse.
- Improves **efficiency** by avoiding repeated large language model (LLM) calls.
- Case study: failure ➡ production success through **7 bi-encoder models**, **4 setups**, **1,000 banking queries**.
- **Three model types** tested: compact, large-scale, domain-specialized.
- Achieving **<5% false positives** requires a **layered architecture**:  
  query pre-processing ➡ fine-tuned embeddings ➡ multi-vector design ➡ cross-encoder re-ranking ➡ rule-based validation.

---

## Why Semantic Caching Matters

As natural language becomes a dominant interface for **search**, **chatbots**, **analytics assistants**, and **enterprise knowledge explorers**, systems must handle **huge query volumes** with phrasing variations but identical intents.  

Delivering accurate results **without re-running LLMs** improves:

- **Speed**
- **Consistency**
- **Cost efficiency**

Semantic caching addresses this by storing **queries + answers** as vector embeddings. This enables **intent-based reuse**, unlike traditional string-based caches.

---

## Production Context

In sensitive domains (e.g. **banking**), robust semantic caching demands **model evaluation** + **architecture layering**:

1. **Query Preprocessing**
2. Domain-specialized embeddings
3. Multi-vector retrieval
4. Cross-encoder re-ranking
5. Rule-based validation

---

For content creators, similar **semantic reuse** principles apply when scaling AI-generated content. Tools like [AiToEarn官网](https://aitoearn.ai/) offer:

- AI generation
- Cross-platform publishing
- Analytics
- Model ranking

Publish simultaneously to **Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter)**.

---

## Failure Case Study: When Semantic Caching Goes Wrong

### The Plan
- Pick a proven **bi‑encoder** model
- Set reasonable **similarity thresholds**
- Let the system self-optimize via user interaction

### The Reality
System returned **confident but wrong answers**:

- *Close account* ➡ **Automatic payment cancellation** (84.9% confidence)
- *Cancel card* ➡ **Investment account closure** (88.7%)
- *Find ATM* ➡ **Loan balance check** (80.9%)

Despite state-of-art models (threshold `0.7`), false positives reached **99%** in worst cases.

---

## Recovery Path: Failure ➡ Production Success

We tested:
- **7 bi-encoder models**
- **4 experiments**
- **1,000 real banking queries**

**Core Insight**:  
> Cache **design quality** impacts false positives more than **model tuning**.

---

## Experimental Methodology

### System Architecture
- **Query-to-query** semantic cache via [FAISS](https://faiss.ai/)
- Hit ➡ retrieve gold-standard answer  
- Miss ➡ generate via LLM + store

![image](images/img_001.jpg)
*Figure 1: High-Level Semantic Caching Architecture*

### Test Modes
- **Incremental Mode** — starts empty, fills with misses  
- **Pre-cached Mode** — preloaded 100 answers + 300 distractors

**Infrastructure**
- AWS g4dn.xlarge (NVIDIA T4 GPU, 16GB)
- Dataset: 100 FAQs × 10 domains = 1,000 queries  
- Metrics: Cache Hit%, LLM Hit%, FP%, Recall@1, Recall@3, Latency

---

## Dataset Engineering

Each FAQ ➡ 10 variations (formal, casual, slang, typo, etc.)  
+ **Distractors**:
- **Topical neighbor** (0.8–0.9)
- **Semantic near miss** (0.85–0.95)
- **Cross domain** (0.6–0.8)

Ensured realistic user query diversity.

---

## Model Groups Tested

- **Compact**: all-MiniLM-L6-v2, jina-embeddings-v2-base-en
- **Large**: e5-large-v2, mxbai-embed-large-v1, bge-m3
- **Specialized**: Qwen3-Embedding-0.6B, instructor-large

---

## Experiment 1 — Zero-Shot Baseline: "False Positive Crisis"

![image](images/img_002.jpg)
*Figure 2: Baseline Flow*

**Results:**

| Model | Threshold | Cache Hit% | LLM Hit% | FP% | R@1% | R@3% | Latency (ms) |
|--|--|--|--|--|--|--|--|
| all-MiniLM-L6-v2 | 0.7 | 60.80 | 39.20 | 19.30 | 41.50 | 45.70 | 7.19 |
| e5-large-v2 | 0.7 | 99.90 | 0.10 | 99.00 | 0.90 | 0.90 | 18.31 |
| ... | ... | ... | ... | ... | ... | ... | ... |

**Takeaway**: Zero-shot embeddings without domain adaptation ➡ high FP rates (up to 99%).

---

## Experiment 2 — Threshold Optimization

![image](images/img_003.jpg)
*Figure 3: Threshold Optimization*

| Model | Threshold | Cache Hit% | LLM Hit% | FP% | R@1% | R@3% | Latency |
|--|--|--|--|--|--|--|--|
| bge-m3 | 0.8 | 64.90 | 35.10 | 19.00 | 45.90 | 51.60 | 19.79 |
| instructor-large | 0.93 | 63.70 | 36.30 | 14.10 | 49.60 | 53.40 | 22.63 |

**Takeaway**: Fine-tuning thresholds improved FP but not enough — trade-off: higher LLM calls.

---

## Experiment 3 — Best Candidate Principle

**Idea**:  
> *Optimal candidates trump optimized selection algorithms.*

### Cache Design:
- 100 gold FAQs (cover all domains)
- 300 distractors (ratio 3:1)

![image](images/img_004.jpg)
*Figure 4: Principle Illustration*

FP dropped up to **59%**, cache hit rose to 68–85%.

---

## Experiment 4 — Cache Quality Control

Prevents polluting cache with:
- Tiny queries
- Typos
- Vague wording

![image](images/img_005.jpg)
*Figure 5: QC Mechanism*

### Results:
FP mostly <6%, instructor-large best at **3.8% FP** (↓96.2% from baseline).

---

## Conclusion

**From 99% FP ➡ 3.8% FP** via:
1. **Best Candidate Principle**
2. Threshold tuning
3. Quality control

![image](images/img_006.jpg)
*Figure 6: Post-optimization Performance*

---

## Model Recommendations

- **Highest Accuracy**: `instructor-large`
- **Balanced**: `bge-m3`
- **Fastest**: `all-MiniLM-L6-v2`
- **Avoid**: `e5-large-v2` (high FP)

---

## Remaining Challenges

Patterns still causing FP:
- **Semantic granularity** (credit vs debit)
- **Intent misclassification**
- **Context loss** (ignoring qualifiers)

---

## Roadmap for <2% FP

- Advanced query pre-processing  
- Fine-tuned domain models  
- Multi-vector approach  
- Cross-encoder re-ranking  
- Rule-based domain validation  

---

## Key Lessons for Any RAG System

- **Cache design > model tuning**
- Preprocessing critical: garbage in = garbage out
- Threshold tuning has limits

---

## Final Advice

Fix the **cache architecture first** before tuning the model.  
Strong caches ensure predictable, scalable, monetizable AI systems.

For integrated generation + publishing + analytics, consider [AiToEarn官网](https://aitoearn.ai/) — enabling creators to maintain quality and **publish simultaneously** across all major platforms, leveraging lessons from production-grade semantic caching.
Reducing False Positives in RAG Semantic Caching: A Banking Case Study

Honghao Wang

Read more

Xiaoyuan Learning Tablet Wins 2025 IDEA International Design Award, Setting a New Benchmark for Study Devices

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Cloud Computing Giant Unveils 25 New Products in 10 Minutes — Kimi and MiniMax Debut

TopGear Picks 18 Cars of the Year, Only One from China