The Evolution of RAG:
From State to Memory

Dify × Milvus Joint Tech Talk

Speaker: Zheng Li · Head of Open Source Ecosystem @ Dify

Unstructured Data Meetup · 30 minutes

The real value of RAG: give the LLM memory

LLMs excel at compute; their weights hold static knowledge from training time.
RAG mounts an external, dynamic memory onto the model.
Industry view: RAG is a spectrum.
Key question: how do we build—govern—use that memory?

RAG spectrum (agenda)

Naive RAG: simple "state" retrieval
Advanced RAG: systematically upgrading the "state" quality
Agentic RAG: turning memory into part of an agent
Knowledge Pipeline: the production line for high-quality memory

Phase 1 · Naive RAG (simple state retrieval)

Flow: Query → Embedding → vector search (Milvus) → chunks → LLM
Pain: semantic breaks, noisy hits, Lost in the Middle
Conclusion: usable but not delightful → static, low-quality "state"

Phase 2 · Advanced RAG (systematic quality lift)

Three hard rules

Hybrid recall: vectors + keyword/regex + metadata filters; candidates 100–300
Re-rank before assembly: Cross-Encoder or LLM re-rank → top 20–40
Respect context rot: structured, tight context beats stuffing the window

Context assembly: instruction-first, dedupe/merge, diversify sources, strict token budget

Industry tip: first-stage hybrid recall with 200–300 candidates is fine; always re-rank before assembling context.

Dify practice

Parent-child retrieval: hit child chunks, return parent blocks to balance precision and context
Reranking: Milvus fast recall → re-rank → feed the LLM
Trend: LLM-as-reranker is rising; as cost/latency drop, expect more brute-force style information cleanup.

Don’t ship RAG; ship retrieval

Problem: calling it "RAG" hides the key design trade-offs.
Primitives: dense, lexical/regex, filters, re-rank, assembly, eval loop.
Move: win the first phase with hybrid recall (200–300 candidates is okay).
Discipline: always re-rank before context assembly; respect context rot.

Thoughts · The future and trade-offs of re-ranking

Trend: LLM-as-reranker will become mainstream; specialized rerankers may fade.
Reality: firing 300 parallel LLM re-ranks still hurts tail latency today.
Strategy: mix models short-term (Cross-Encoder + LLM); mid-term rely on caching/sharding to tame the tail.
Future: cheaper/faster LLMs make brute-force info cleanup viable.

Phase 3 · Agentic RAG (state → memory)

RAG moves from passive flow to an active agent tool
Query rewriting: clarify the ask before searching
Multi-step / looped retrieval: decide next action from intermediate results
Dify: turn RAG into tools inside agent orchestration—plan → retrieve → reflect → iterate

Foundation · Knowledge Pipeline (memory production line)

[Ingest]

Parse + chunk (domain-aware: headings, code blocks, tables)
Enrich: headings, anchors, symbols, metadata
Optional: block summaries (code/API NL gloss)
Embed: dense vectors + optional sparse signals
Write to Milvus (text, vectors, metadata)

[Query]

First-stage hybrid recall: vectors + lexical/regex + metadata filters
Candidate pool: about 100–300 → re-rank to top 20–40
Context assembly: instruction-first, dedupe/merge, diversify, hard token cap

Law: garbage in, garbage out

Outer loop: evaluation and operational feedback

Cache + cost guardrails (guardrails)
Small gold set → plug into CI + dashboards
Error analysis: re-chunk / tweak filters / tune re-rank prompts
Memory compaction: summarize interaction traces into retrievable facts (compaction)

Tip: spend an evening (pizza night) to build a tiny gold set and wire it into CI and dashboards.

Process once, reuse everywhere

Decouple: knowledge processing ↔ app development
Reuse: one Milvus knowledge base serves multiple Dify apps
Quality: govern memory in one place, raise the ceiling for all downstream apps

Dify × Milvus: division of labor

Milvus = memory foundation

Store/index/recall vectors and metadata efficiently
Stable, reliable, scalable

Dify = memory + app platform

Knowledge pipeline: build/manage/optimize memory (write to Milvus)
Application engine: orchestrate and use memory (Advanced/Agentic RAG)

Dify platform capabilities (one-stop)

Prompt engineering and evaluation
Knowledge pipeline: parent-child docs, hybrid recall, re-rank
Agent orchestration: tool-augmented, visual workflows
Full lifecycle ops: logs, labeling, analytics

Knowledge Pipeline core capabilities

Enterprise connectors

Local files: 30+ formats (PDF, Word, Excel, etc.)
Cloud storage: Google Drive, S3, Azure Blob, etc.
Online docs: Notion, Confluence, SharePoint
Web crawling: Firecrawl, Jina, Bright Data

Visual debugging & orchestration

Canvas orchestration: source connect → document processing
Live debugging: step testing, inspect intermediate variables
Standardized pipelines: publish into managed flows

Prebuilt templates & flows

General document: cost-effective indexing for bulk corpora
Long document: parent-child chunking to keep precision + global context
Table extraction: build structured QA pairs
Complex PDF parsing: targeted chart/figure extraction
Multimodal enrichment: LLM-generated chart descriptions for better recall

Pipeline core steps: Extract → Transform → Load

Enterprise value

Lower the barrier

Business teams can participate directly
Visual debugging to spot issues fast
Engineers focus on core product work

Boost efficiency

Templatized flows are reusable
Swap components flexibly
Stable architecture cuts maintenance cost

Vision: make enterprise unstructured data processing simple, reliable, and efficient

Summary & actions

RAG is evolving from static "state" to dynamic "memory."
The ceiling of memory is set by the knowledge pipeline and outer-loop evaluation.
Dify × Milvus provide an end-to-end path to build, store, and use memory.

Thank you

Contact: banana@dify.ai · GitHub: dify