Deploying a Multimodal RAG System with vLLM

# Transcript: Multimodal RAG Systems, Vector Search, and AI Publishing

## Introduction

**Stephen Batifol**:  
Today we’re going to talk about **multimodal RAG systems** using **vLLM** and **Pixtral** from Mistral.  
We’ll also cover:

- **Vector search & vector databases**
- **Index types and trade-offs**
- **Embedding models**
- A live **demo**

**About me:**  
Former Developer Advocate at **Milvus** (open-source vector database),  
now at **Black Forest Labs**, an image generation company.

---

## What Is Vector Search?

### Definition & Importance
Vector search has been heavily promoted recently, but many still don’t fully understand it.  
The key idea: **vectors unlock your unstructured data**.

Examples of unstructured data:
- Images
- Audio
- Video
- Documents

These are processed through **deep learning models** to create **vector embeddings**,  
stored in a vector database for:

- Search
- RAG (Retrieval-Augmented Generation)
- Other advanced tasks (e.g., **drug discovery**, recommendations)

---

### How It Works
1. **Transform unstructured data** into high-dimensional embeddings (often 2K–3K dimensions).
2. **Store embeddings** in a vector database.
3. **Nearest neighbor search** to retrieve semantically similar items.

Example:
- "Banana" text and image are close in embedding space.
- Cats and dogs near each other but far from fruit.

---

## Index Types & Trade-offs

### FLAT
- Brute-force search: compare query to all vectors.
- Accurate but slow at large scale.

### IVF (Inverted File Index)
- Clusters vectors around centroids.
- Search within closest clusters only.
- Faster, tunable, efficient for large datasets.

### HNSW (Hierarchical Navigable Small World Graphs)
- Graph-based, fast querying.
- Slower to build/rebuild.

---

### Additional Index Types
- **GPU Index**: High latency performance, costly.
- **DiskANN**: Disk-based, cost-efficient, higher latency.

**Trade-offs:**
- **Complexity vs. speed vs. accuracy**
- **Index build time** is critical at scale.

---

## Embedding Models

### Why They Matter
Good embeddings = good RAG performance.  
Even with the best LLM and DB, low-quality embeddings cripple retrieval.

---

### Choosing the Right Model
1. **Dimension compatibility** with your database:
   - Example: `pgvector` can’t handle embeddings > 2000 dims.
2. **Task-specific performance**: classification, clustering, retrieval.
3. **Language/domain suitability**.
4. **Modality**: text, image, multimodal.
5. **Open-source vs API-based**.

---

### Leaderboards
- Don’t pick purely by rank.
- Consider **Gemini** (3072 dims) vs smaller models for compatibility.

---

## Evaluation: Beyond "Vibe Checks"

### Why Evaluate?
- Measure performance on:
  - **Embedding quality**
  - **Recall**
  - **Latency**
- Test *modules individually*:
  - Retrieval
  - Embedding
  - Generation

---

### Common Pitfalls
- Frequent LLM swaps without testing compatibility.
- Ignoring dataset flaws.

---

## RAG (Retrieval-Augmented Generation)

### Long Context Benchmark Issues
- High token capacity models still show accuracy drop beyond certain lengths.
- Example: Llama 4 promises 10M tokens but only ~15–28% accuracy at 120K tokens.

---

### Why RAG Matters
- Efficiently pulls **relevant context**.
- Reduces token usage.
- Improves speed & accuracy compared to pure long-context.

---

## Retrieval Strategies

### Hybrid Search
Combining:
- **Semantic search** (vector-based)
- **BM25 keyword search**

Process:
1. Store documents in vector DB & keyword index.
2. Retrieve **top K** semantic + **top K** BM25 matches.
3. Fuse via **re-ranker**.
4. Send unified results to LLM.

---

### Metadata Filtering
- Store & use source, author, date, region, domain.
- Better precision, less noise.
- Example: Refund policy filtered by region.

---

## Advanced RAG: Agentic & GraphRAG

- **GraphRAG**: RAG + knowledge graph for relational queries.
- Trade-offs: latency and data transformation overhead.

---

## Building a Self-Hosted Multimodal RAG System

### Stack:
- **Milvus**: Vector DB
- **vLLM**: Model serving
- **Koyeb**: Infrastructure
- **Pixtral**: Multimodal LLM

---

### Why Multimodal?
- Capture visuals/audio lost in text-only RAG.
- Enables image queries, illustrative context.

---

## Infrastructure Notes

### Koyeb:
- Autoscaling
- Scale-to-zero
- Globally distributed

### Milvus:
- Scales to 100B vectors
- Filtering, bulk import, disk/GPU indexing

### Pixtral:
- Native multimodal
- Custom vision encoder
- Flexible aspect ratios

---

## vLLM Advantages
- High performance
- End-to-end optimization
- Multiple hardware compatibility
- Open source

---

## Deployment Challenges

### Latency & Throughput
- Balance **response time** vs. **batch capacity**.
- Use **dynamic batching** to improve scale.

---

### GPU Memory Planning

Model size (B params) × bytes per param

Example (FP16):
- Llama 3 8B → 16GB RAM  
- Pixtral → 24GB RAM

---

### Parallelism Strategies
1. **Replication**: same model on multiple GPUs (small batch size).
2. **Sharding (tensor parallelism)**: split layers/matrices across GPUs (better utilization).

---

## Inference Optimization

1. **Quantization** to reduce RAM:
   - Weight-only
   - Weight+activation
2. **KV Cache Optimizations**:
   - Avoid waste via **paged attention**.

---

## Demo Recap
- Streamlit UI
- Indexed metadata
- Queries tested
- Retrieval failures → illustrate importance of correct indexing/embedding

---

## Q&A Highlights
- CAG (Cache-Augmented Generation)
- Milvus interface: *Attu*
- Chunking strategies: depends on doc type & content
- PDFs with images → multimodal embedding approaches (e.g., PaLI)

---

## Key Takeaways
- **Right index** choice per use case.
- **Embedding selection** matters as much as LLM choice.
- **Hybrid retrieval** improves precision.
- **Evaluate all components**, avoid "vibe checks".
- **Optimize inference** for latency, throughput, and GPU usage.
- Integrate **multimodal** capabilities where context demands it.

---

> **For creators and developers**, platforms like [AiToEarn官网](https://aitoearn.ai/) provide an **open-source global AI content monetization ecosystem**:
> - Generate AI-powered content
> - Publish simultaneously across major platforms:
>   Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter
> - Analytics & [AI模型排名](https://rank.aitoearn.ai)
> - Connect multimodal RAG outputs to **audience monetization**

Explore:
- [AiToEarn博客](https://blog.aitoearn.ai)  
- [AiToEarn开源地址](https://github.com/yikart/AiToEarn)  
- [AI模型排名](https://rank.aitoearn.ai)

Read more