🔍 RAG System Building

从零搭建生产级 RAG 系统的完整指南
_{Document parsing · Vector search · Reranking · LLM integration · Frontend UI}

🤔 Why This Repo?

想做一个「上传 PDF → 智能问答」的系统，搜了一圈发现：

教程太浅：只讲 langchain.load_qa_chain()，生产环境根本不能用
方案太重：上来就要 Milvus + Kubernetes，个人项目扛不住
缺少对比：向量检索 vs 关键词检索 vs 混合检索，到底选哪个？
踩坑太多：embedding 维度不匹配、中文分词断裂、reranker 超时……

这个仓库把从零到生产的完整路径整理清楚了，每一步都有代码、有对比、有踩坑记录。

📖 目录

文档	内容
Architecture	RAG 系统架构设计（数据流、模块划分、技术选型）
Document Parsing	文档解析方案（PDF/DOCX/TXT，表格提取，OCR）
Vector Search	向量检索方案（FAISS/ChromaDB/Milvus，embedding 模型选型）
Reranking	重排策略对比（TF-IDF vs CrossEncoder vs LLM，附性能基准测试）
LLM Integration	LLM 接入方案（通义千问/DeepSeek/OpenAI/Ollama）
Performance	性能优化指南（延迟分析、瓶颈定位、优化清单）
Common Pitfalls	踩坑记录（30+ 个真实问题及解决方案）

🚀 Quick Start

最小可运行版本（5 分钟）

git clone https://github.com/Vincent-crypto-coder/rag-system-building.git
cd rag-system-building
pip install -r requirements.txt

# 设置 API Key
export DASHSCOPE_API_KEY=sk-your-key

# 启动
python main.py

技术栈

组件	推荐	备选
Web 后端	FastAPI	Flask
前端 UI	Streamlit	Gradio
向量数据库	FAISS（轻量）	ChromaDB, Milvus
PDF 解析	PyMuPDF + pdfplumber	pypdf2
Embedding	BAAI/bge-small-zh（本地）	text-embedding-3-small
LLM	通义千问 qwen-turbo	DeepSeek, Ollama
重排	TF-IDF（快速）	CrossEncoder（精准）
聊天历史	SQLite	Redis

🏗️ Architecture

┌─────────────────────────────────────────────────────┐
│                  Ingestion Pipeline                   │
│  PDF → Parser → Splitter → Embedding → VectorStore  │
└─────────────────────────────────────────────────────┘
                           │
                     User Query
                           ▼
┌─────────────────────────────────────────────────────┐
│                    Query Pipeline                     │
│  Query → Embedding → VectorSearch → Reranker         │
│       → Context + History → LLM → Response           │
└─────────────────────────────────────────────────────┘

⚡ Reranking Benchmark

我们对比了三种重排方案的实际性能（10 个文档块，qwen-turbo）：

方案	延迟	准确率	适用场景
无重排	0ms	基线	对延迟敏感
TF-IDF	3ms	+12%	默认推荐
CrossEncoder	450ms	+18%	精度优先
LLM Reranker	5.5s	+22%	预算充足

结论：TF-IDF 在速度和准确率之间取得了最佳平衡。CrossEncoder 适合精度要求高的场景。LLM Reranker 性价比最低。

详见 benchmarks/reranking-comparison.md

🔧 Templates

模板	用途
`templates/rag-pipeline.py`	完整 RAG 管道（解析→检索→生成）
`templates/streamlit-ui.py`	Streamlit 前端（上传+聊天+历史）
`templates/config.py`	配置管理模板

📦 Real-World Example

examples/pdf-qa-system/ — PDF 智能问答系统完整实现：

支持 PDF/TXT/DOCX 多格式
混合检索（向量 + 关键词兜底）
多种 LLM 切换（通义千问/DeepSeek/Ollama）
Streamlit UI（上传、聊天、历史管理）
RAGAS 评估框架集成

🧠 Advanced Topics

LightRAG（知识图谱 RAG）

当需要实体关系推理时，使用 LightRAG 构建知识图谱：

from lightrag import LightRAG
rag = LightRAG(working_dir="./data", llm_model_func=async_llm, embedding_func=async_embed)
rag.insert(document_text)
results = rag.query("xxx", param=QueryParam(mode="mix"))  # local + global

详见 docs/architecture.md

关键词兜底检索

当向量检索分数过低（< 0.5）时，自动触发关键词匹配：

# 中文关键词提取陷阱：不要提取整句话
# ❌ re.findall(r'[\u4e00-\u9fa5]{2,}', question)  → 整句话作为一个关键词
# ✅ re.findall(r'[\u4e00-\u9fa5]{2,4}', question)  → 2-4 字的短词

详见 docs/vector-search.md

🤝 Contributing

欢迎提交 PR！特别是：

新的 embedding 模型对比测试
不同向量数据库的性能基准
更多语言的支持（日语、韩语等）
生产环境部署方案（Docker、K8s）

⭐ Star History

如果这个仓库帮到了你，给个 Star 吧 ⭐

📄 License

MIT

_{Made with 🔍 (obsessive RAG debugging) by Vincent}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
docs		docs
examples/pdf-qa-system		examples/pdf-qa-system
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
article-juejin.md		article-juejin.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 RAG System Building

🤔 Why This Repo?

📖 目录

🚀 Quick Start

最小可运行版本（5 分钟）

技术栈

🏗️ Architecture

⚡ Reranking Benchmark

🔧 Templates

📦 Real-World Example

🧠 Advanced Topics

LightRAG（知识图谱 RAG）

关键词兜底检索

🤝 Contributing

⭐ Star History

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 RAG System Building

🤔 Why This Repo?

📖 目录

🚀 Quick Start

最小可运行版本（5 分钟）

技术栈

🏗️ Architecture

⚡ Reranking Benchmark

🔧 Templates

📦 Real-World Example

🧠 Advanced Topics

LightRAG（知识图谱 RAG）

关键词兜底检索

🤝 Contributing

⭐ Star History

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages