The Evolution of RAG:
From State to Memory
Dify × Milvus Joint Tech Talk
Speaker: Zheng Li · Head of Open Source Ecosystem @ Dify
Unstructured Data Meetup · 30 minutes
The real value of RAG: give the LLM memory
- LLMs excel at compute; their weights hold static knowledge from training time.
- RAG mounts an external, dynamic memory onto the model.
- Industry view: RAG is a spectrum.
- Key question: how do we build—govern—use that memory?
RAG spectrum (agenda)
- Naive RAG: simple "state" retrieval
- Advanced RAG: systematically upgrading the "state" quality
- Agentic RAG: turning memory into part of an agent
- Knowledge Pipeline: the production line for high-quality memory
Phase 1 · Naive RAG (simple state retrieval)
- Flow: Query → Embedding → vector search (Milvus) → chunks → LLM
- Pain: semantic breaks, noisy hits, Lost in the Middle
- Conclusion: usable but not delightful → static, low-quality "state"
Phase 2 · Advanced RAG (systematic quality lift)
Three hard rules
- Hybrid recall: vectors + keyword/regex + metadata filters; candidates 100–300
- Re-rank before assembly: Cross-Encoder or LLM re-rank → top 20–40
- Respect context rot: structured, tight context beats stuffing the window
Context assembly: instruction-first, dedupe/merge, diversify sources, strict token budget
Industry tip: first-stage hybrid recall with 200–300 candidates is fine; always re-rank before assembling context.
Dify practice
- Parent-child retrieval: hit child chunks, return parent blocks to balance precision and context
- Reranking: Milvus fast recall → re-rank → feed the LLM
- Trend: LLM-as-reranker is rising; as cost/latency drop, expect more brute-force style information cleanup.
Don’t ship RAG; ship retrieval
- Problem: calling it "RAG" hides the key design trade-offs.
- Primitives: dense, lexical/regex, filters, re-rank, assembly, eval loop.
- Move: win the first phase with hybrid recall (200–300 candidates is okay).
- Discipline: always re-rank before context assembly; respect context rot.
Thoughts · The future and trade-offs of re-ranking
- Trend: LLM-as-reranker will become mainstream; specialized rerankers may fade.
- Reality: firing 300 parallel LLM re-ranks still hurts tail latency today.
- Strategy: mix models short-term (Cross-Encoder + LLM); mid-term rely on caching/sharding to tame the tail.
- Future: cheaper/faster LLMs make brute-force info cleanup viable.
Phase 3 · Agentic RAG (state → memory)
- RAG moves from passive flow to an active agent tool
- Query rewriting: clarify the ask before searching
- Multi-step / looped retrieval: decide next action from intermediate results
- Dify: turn RAG into tools inside agent orchestration—plan → retrieve → reflect → iterate
Foundation · Knowledge Pipeline (memory production line)
[Ingest]
- Parse + chunk (domain-aware: headings, code blocks, tables)
- Enrich: headings, anchors, symbols, metadata
- Optional: block summaries (code/API NL gloss)
- Embed: dense vectors + optional sparse signals
- Write to Milvus (text, vectors, metadata)
[Query]
- First-stage hybrid recall: vectors + lexical/regex + metadata filters
- Candidate pool: about 100–300 → re-rank to top 20–40
- Context assembly: instruction-first, dedupe/merge, diversify, hard token cap
Law: garbage in, garbage out
Outer loop: evaluation and operational feedback
- Cache + cost guardrails (guardrails)
- Small gold set → plug into CI + dashboards
- Error analysis: re-chunk / tweak filters / tune re-rank prompts
- Memory compaction: summarize interaction traces into retrievable facts (compaction)
Tip: spend an evening (pizza night) to build a tiny gold set and wire it into CI and dashboards.
Process once, reuse everywhere
- Decouple: knowledge processing ↔ app development
- Reuse: one Milvus knowledge base serves multiple Dify apps
- Quality: govern memory in one place, raise the ceiling for all downstream apps
Dify × Milvus: division of labor
Milvus = memory foundation
- Store/index/recall vectors and metadata efficiently
- Stable, reliable, scalable
Dify = memory + app platform
- Knowledge pipeline: build/manage/optimize memory (write to Milvus)
- Application engine: orchestrate and use memory (Advanced/Agentic RAG)
Dify platform capabilities (one-stop)
- Prompt engineering and evaluation
- Knowledge pipeline: parent-child docs, hybrid recall, re-rank
- Agent orchestration: tool-augmented, visual workflows
- Full lifecycle ops: logs, labeling, analytics
Knowledge Pipeline core capabilities
Enterprise connectors
- Local files: 30+ formats (PDF, Word, Excel, etc.)
- Cloud storage: Google Drive, S3, Azure Blob, etc.
- Online docs: Notion, Confluence, SharePoint
- Web crawling: Firecrawl, Jina, Bright Data
Visual debugging & orchestration
- Canvas orchestration: source connect → document processing
- Live debugging: step testing, inspect intermediate variables
- Standardized pipelines: publish into managed flows
Prebuilt templates & flows
- General document: cost-effective indexing for bulk corpora
- Long document: parent-child chunking to keep precision + global context
- Table extraction: build structured QA pairs
- Complex PDF parsing: targeted chart/figure extraction
- Multimodal enrichment: LLM-generated chart descriptions for better recall
Pipeline core steps: Extract → Transform → Load
Enterprise value
Lower the barrier
- Business teams can participate directly
- Visual debugging to spot issues fast
- Engineers focus on core product work
Boost efficiency
- Templatized flows are reusable
- Swap components flexibly
- Stable architecture cuts maintenance cost
Vision: make enterprise unstructured data processing simple, reliable, and efficient
Summary & actions
- RAG is evolving from static "state" to dynamic "memory."
- The ceiling of memory is set by the knowledge pipeline and outer-loop evaluation.
- Dify × Milvus provide an end-to-end path to build, store, and use memory.
Thank you
Contact: banana@dify.ai · GitHub: dify