Deploying a Multimodal RAG System with vLLM
# Transcript: Multimodal RAG Systems, Vector Search, and AI Publishing
## Introduction
**Stephen Batifol**:
Today we’re going to talk about **multimodal RAG systems** using **vLLM** and **Pixtral** from Mistral.
We’ll also cover:
- **Vector search & vector databases**
- **Index types and trade-offs**
- **Embedding models**
- A live **demo**
**About me:**
Former Developer Advocate at **Milvus** (open-source vector database),
now at **Black Forest Labs**, an image generation company.
---
## What Is Vector Search?
### Definition & Importance
Vector search has been heavily promoted recently, but many still don’t fully understand it.
The key idea: **vectors unlock your unstructured data**.
Examples of unstructured data:
- Images
- Audio
- Video
- Documents
These are processed through **deep learning models** to create **vector embeddings**,
stored in a vector database for:
- Search
- RAG (Retrieval-Augmented Generation)
- Other advanced tasks (e.g., **drug discovery**, recommendations)
---
### How It Works
1. **Transform unstructured data** into high-dimensional embeddings (often 2K–3K dimensions).
2. **Store embeddings** in a vector database.
3. **Nearest neighbor search** to retrieve semantically similar items.
Example:
- "Banana" text and image are close in embedding space.
- Cats and dogs near each other but far from fruit.
---
## Index Types & Trade-offs
### FLAT
- Brute-force search: compare query to all vectors.
- Accurate but slow at large scale.
### IVF (Inverted File Index)
- Clusters vectors around centroids.
- Search within closest clusters only.
- Faster, tunable, efficient for large datasets.
### HNSW (Hierarchical Navigable Small World Graphs)
- Graph-based, fast querying.
- Slower to build/rebuild.
---
### Additional Index Types
- **GPU Index**: High latency performance, costly.
- **DiskANN**: Disk-based, cost-efficient, higher latency.
**Trade-offs:**
- **Complexity vs. speed vs. accuracy**
- **Index build time** is critical at scale.
---
## Embedding Models
### Why They Matter
Good embeddings = good RAG performance.
Even with the best LLM and DB, low-quality embeddings cripple retrieval.
---
### Choosing the Right Model
1. **Dimension compatibility** with your database:
- Example: `pgvector` can’t handle embeddings > 2000 dims.
2. **Task-specific performance**: classification, clustering, retrieval.
3. **Language/domain suitability**.
4. **Modality**: text, image, multimodal.
5. **Open-source vs API-based**.
---
### Leaderboards
- Don’t pick purely by rank.
- Consider **Gemini** (3072 dims) vs smaller models for compatibility.
---
## Evaluation: Beyond "Vibe Checks"
### Why Evaluate?
- Measure performance on:
- **Embedding quality**
- **Recall**
- **Latency**
- Test *modules individually*:
- Retrieval
- Embedding
- Generation
---
### Common Pitfalls
- Frequent LLM swaps without testing compatibility.
- Ignoring dataset flaws.
---
## RAG (Retrieval-Augmented Generation)
### Long Context Benchmark Issues
- High token capacity models still show accuracy drop beyond certain lengths.
- Example: Llama 4 promises 10M tokens but only ~15–28% accuracy at 120K tokens.
---
### Why RAG Matters
- Efficiently pulls **relevant context**.
- Reduces token usage.
- Improves speed & accuracy compared to pure long-context.
---
## Retrieval Strategies
### Hybrid Search
Combining:
- **Semantic search** (vector-based)
- **BM25 keyword search**
Process:
1. Store documents in vector DB & keyword index.
2. Retrieve **top K** semantic + **top K** BM25 matches.
3. Fuse via **re-ranker**.
4. Send unified results to LLM.
---
### Metadata Filtering
- Store & use source, author, date, region, domain.
- Better precision, less noise.
- Example: Refund policy filtered by region.
---
## Advanced RAG: Agentic & GraphRAG
- **GraphRAG**: RAG + knowledge graph for relational queries.
- Trade-offs: latency and data transformation overhead.
---
## Building a Self-Hosted Multimodal RAG System
### Stack:
- **Milvus**: Vector DB
- **vLLM**: Model serving
- **Koyeb**: Infrastructure
- **Pixtral**: Multimodal LLM
---
### Why Multimodal?
- Capture visuals/audio lost in text-only RAG.
- Enables image queries, illustrative context.
---
## Infrastructure Notes
### Koyeb:
- Autoscaling
- Scale-to-zero
- Globally distributed
### Milvus:
- Scales to 100B vectors
- Filtering, bulk import, disk/GPU indexing
### Pixtral:
- Native multimodal
- Custom vision encoder
- Flexible aspect ratios
---
## vLLM Advantages
- High performance
- End-to-end optimization
- Multiple hardware compatibility
- Open source
---
## Deployment Challenges
### Latency & Throughput
- Balance **response time** vs. **batch capacity**.
- Use **dynamic batching** to improve scale.
---
### GPU Memory PlanningModel size (B params) × bytes per param
Example (FP16):
- Llama 3 8B → 16GB RAM
- Pixtral → 24GB RAM
---
### Parallelism Strategies
1. **Replication**: same model on multiple GPUs (small batch size).
2. **Sharding (tensor parallelism)**: split layers/matrices across GPUs (better utilization).
---
## Inference Optimization
1. **Quantization** to reduce RAM:
- Weight-only
- Weight+activation
2. **KV Cache Optimizations**:
- Avoid waste via **paged attention**.
---
## Demo Recap
- Streamlit UI
- Indexed metadata
- Queries tested
- Retrieval failures → illustrate importance of correct indexing/embedding
---
## Q&A Highlights
- CAG (Cache-Augmented Generation)
- Milvus interface: *Attu*
- Chunking strategies: depends on doc type & content
- PDFs with images → multimodal embedding approaches (e.g., PaLI)
---
## Key Takeaways
- **Right index** choice per use case.
- **Embedding selection** matters as much as LLM choice.
- **Hybrid retrieval** improves precision.
- **Evaluate all components**, avoid "vibe checks".
- **Optimize inference** for latency, throughput, and GPU usage.
- Integrate **multimodal** capabilities where context demands it.
---
> **For creators and developers**, platforms like [AiToEarn官网](https://aitoearn.ai/) provide an **open-source global AI content monetization ecosystem**:
> - Generate AI-powered content
> - Publish simultaneously across major platforms:
> Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter
> - Analytics & [AI模型排名](https://rank.aitoearn.ai)
> - Connect multimodal RAG outputs to **audience monetization**
Explore:
- [AiToEarn博客](https://blog.aitoearn.ai)
- [AiToEarn开源地址](https://github.com/yikart/AiToEarn)
- [AI模型排名](https://rank.aitoearn.ai)