From Autonomous AI
to Production-Grade
Agent Systems

Human checkpoints, SOP-managed context, and sandboxed execution for production-grade agent systems.

crazywoola (Banana)

Developer Relations @ Dify

banana@dify.ai

Agents Can Act — But Production Is Not Ready

Model capability has crossed the "usable" threshold, but teams pushing agents into production keep hitting the same three walls.

Hallucinations Reach Users

Without human checkpoints, AI-generated errors can reach end users directly. One incident destroys trust.

Compliance Cannot Close

Finance, healthcare, and government need approval records and traceable audit trails. Pure automation fails the audit.

Fragile Toolchains

Prompts grow endlessly, tool lists keep expanding, handoffs rely on hidden state — maintenance cost far exceeds expectations.

Session Agenda

The Evolution

From prompt pipelines to orchestrated agents — why the shift matters

3 min

Human-in-the-Loop (HITL)

Pause / resume / approve mid-execution for compliance-safe deployments

8 min

Agent × Skills

Current failure modes, Skills, SOP context, and explicit deliverables

12 min

Sandboxed Runtime & Collaboration

POSIX-style runtime, Command node, safe execution, and team collaboration

10 min

Q&A

Open Discussion

Live demo · your production use-cases · roadmap preview

5 min

Total ~38 min

Part 1

The Evolution of AI Systems

Moving beyond one-shot prompts into fully orchestrated, controllable agent architectures.

Three Generations of LLM Applications

Each generation unlocks new value — and new complexity.

Gen 1

Prompt → Response

Single-turn completions. No memory, no tools, no state.

ChatGPT wrappers one-shot summarizers

Gen 2

Pipeline Orchestration

Chained nodes with data transformation, RAG, and conditional branching.

LangChain Dify Workflow

Gen 3 · Now

Agent Systems

Pause on human input. Reuse skills and SOPs. Pass explicit deliverables forward. Execute safely inside a sandbox.

HITL Skills Sandbox

Three Architectural Advances

The features that define Dify's production-grade agent direction.

Human-in-the-Loop Nodes

Dedicated execution nodes that pause a workflow and wait for human approval, override, or rejection before continuing. Enables auditable, compliance-friendly AI.

Agent × Skills

A thinner agent layer built around reusable Skills, SOP-managed context, and explicit deliverables instead of giant prompt blobs.

Sandboxed Runtime & Collaboration

A POSIX-style runtime with isolated execution, command-based workflows, and collaborative authoring around shared operational knowledge.

Part 2

Human-in-the-Loop (HITL)

Put human judgment inside the workflow graph, not beside it.

Why Teams Need HITL & Where It Fits

Oversight should not be a patch — it should be a native gate placed exactly where the workflow needs it.

Shifting Objectives

Work changes mid-run. A pause point keeps workflows from becoming rigid.

Trust Gap

High-stakes teams need visible checkpoints before AI can act for the business.

Integration Complexity

External approval queues and webhooks turn oversight into extra engineering, not native capability.

Before External Actions

Pause before a workflow sends an email, publishes content, submits a ticket, or contacts a customer.

When Confidence Drops

Use HITL on anomalies and edge cases instead of reviewing every run.

When Context Is Missing

Ask for one missing field, then continue automatically with the updated value.

When Policy Requires Sign-Off

Finance, compliance, and customer-facing flows often need a visible approval checkpoint.

→ AUDIENCE CHECK-IN: "Before I explain how it works — raise your hand if you've ever had to build a custom approval integration for an AI workflow. Webhooks, Slack bots, manual email chains..." (pause) "Those are all symptoms of the same problem: human oversight was designed as an afterthought." Walk the three problems: "Shifting objectives — priorities change mid-run, no pause point means the workflow becomes rigid. Trust gap — high-stakes teams need visible checkpoints. Integration complexity — external approval queues turn oversight into extra engineering." Then walk the four placements: "HITL is not everywhere — it belongs at these four types of points: before external actions; when confidence drops on edge cases; when context is missing — ask for one field and continue; when policy requires sign-off — finance and compliance often have non-negotiable requirements." Read the banner: "Fewer, better-placed nodes usually beat reviewing every step. If reviewers still need another system, the node design is incomplete." (~2 min)

How a Human Input Node Works

Execution pauses, a human sees the right context, and the workflow resumes on one of three simple outcomes.

Workflow Running

→

HITL Gate

→

Paused · Notification Sent

↓

Human Reviews Context

↓

Approve

↓ Continue

Edit & Approve

↓ Modified Values

Reject

↓ Alt Branch

Delivery

Generate a review page and route it to the right person.

Variables

Insert editable fields and return new values safely.

Actions

Buttons, branches, and timeout rules to ensure resumption.

Walk through the flow top to bottom: "The workflow is running. It hits the HITL gate and pauses. A notification goes to the right person — with exactly the context they need to make a decision, no other system needed." Point to each outcome: "Approve — continue as-is. Edit and approve — modify a value, then continue; the modified value flows into downstream steps. Reject — route to an alternative branch, ensuring the workflow always has an exit." Point to the three cards below: "These map to node setup: Delivery configures how the reviewer is notified and where they review. Variables defines what editable fields they see. Actions sets which buttons trigger which branch." Key point to land: "The review page IS the job surface. If the reviewer needs to open another system to finish the task, the node design is incomplete." (~2 min)

Liang · Investment Services

HITL adds expert judgment exactly where automated output becomes client-facing.

Scaling Problem

40 min

manual work per client

100+

clients to serve

Report generation was automated, but compliance still needed a final look before financial updates reached clients.

HITL Placement

after synthesis on anomalies before send

Reviewers saw exactly what clients would receive, edited if needed, and approved with one click. By June, all 100 clients received consistent reports.

"Here's the first HITL pattern: compliance approval. Liang runs an investment services team — personalized financial updates to over 100 clients, every single day." Build the problem: "At 40 minutes of manual work per client, that's 67 hours daily. They automated report generation — but compliance still needed a human eye before anything financial reached a client." The HITL placement: "Three nodes: after synthesis — to catch hallucinated figures before formatting; on anomalies — to flag only reports that deviate from the prior week's baseline; before send — one final approval click." The result: "By June, all 100 clients received consistent daily reports. Reviewers weren't doing the work anymore — they were applying judgment exactly where it matters most." Read the quote as if Liang said it directly to you. (~2 min)

Min · Global Support Team

HITL is not only for approval. It is also a clean way to request missing context.

Support Challenge

Employees moved across separate HR, finance, and IT portals. Requests often arrived without the details needed to route them correctly.

single entry point query classification knowledge routing

Where HITL Helped

When Jason from R&D asked about reimbursement, the workflow noticed missing location data, requested it through Human Input, then returned the right Shanghai-office policy.

Transition: "That was compliance approval — HITL as a gate. This case is different — HITL as context collection. That's the second pattern." Open: "Min's team consolidated HR, finance, and IT portals into a single support entry point for a global company. Employees used to navigate three separate systems. Now they go to one place." The Jason scenario: "Jason from R&D asks: 'What's the reimbursement policy?' Simple question — but the answer depends on which office he's in. The workflow doesn't have that information. The HITL node pauses and asks Jason: 'Which office are you based in?' He says Shanghai. The workflow resumes and returns the correct Shanghai-office policy." Key reframe: "That's not approval — that's collaborative intelligence. The AI does the heavy lifting; the human fills in the one gap the AI couldn't infer." Read the quote out loud. (~2 min)

Part 3

Agent × Skills

The agent becomes smaller: choose the right SOP, call the right skill, and leave usable artifacts behind.

From Monolithic Prompt to Thin Orchestrator

Four failure modes in today's agent workflows — and the better operating model.

Single-run flow Tool noise Fragile files Long debug loops

Before — Prompt Does Everything

Inside the prompt

tool routing file handling retry logic output formatting

What breaks

duplicated logic hard to test tool bloat hidden state

After — Agent Orchestrates

Agent owns

goal choose SOP call skills pick deliverables

Workflow gets

text files fields memory snapshot

"Before we talk about Skills and SOPs, let's be honest about where current agent workflows break down — and what the better operating model looks like." Walk the four failure chips quickly: "Single-run flow — multi-turn behavior and multi-agent handoffs are still awkward. Tool noise — the more tools you add, the more fragile routing becomes. Fragile files — files move through hidden IDs nobody can debug. Long debug loops — strategy, routing, file flow, and memory are split across four layers." Pivot to the comparison: "Left side: today's pattern — the prompt handles tool routing, file handling, retry logic, and output formatting. It works until it doesn't. Then you have no idea which part broke. Right side: the better model — the agent owns the goal and sequencing, skills own execution, the workflow gets structured outputs it can actually route on." Say this out loud: "Skills are to agents what functions are to software: reusable units with explicit inputs and outputs." (~1.5 min)

Missing Deliverables Break Workflows

If the useful artifact stays inside the agent's memory, downstream nodes can only guess from prose.

Workflow example showing an agent feeding an IF/ELSE node

Example: an IF/ELSE node tries to infer state from plain text.

Text is not state

Checking whether the agent happened to say success is brittle and hard to maintain.

Files disappear

Raw tables, reports, or generated artifacts can stay buried in memory while the next node only sees the summary.

Agents cannot relay work

The next agent cannot reliably see what the previous one actually delivered.

"This is the most common failure mode in production agent workflows — I see it regardless of team skill level." Point to the workflow image: "An agent runs, calls tools, does real work. But the downstream IF/ELSE node can only see text output. So teams write conditions like: 'if the output contains the word success, continue.' That is not state management — that is hoping." Walk the three failure modes: "Text is not state — string matching is brittle and hard to maintain. Files disappear into memory — the next node only sees the summary. Agents cannot relay work — the next agent doesn't reliably know what the previous one actually delivered." Land hard: "If downstream cannot consume it, it is not a real deliverable. It's a side effect you're praying works." (~1 min)

What a Node Should Hand Off

A production workflow needs more than a polished answer.

Text Answer

The human-facing explanation or final response.

Files

Reports, tables, images, and other artifacts downstream steps can keep using.

Structured Fields

Status, decisions, IDs, and parameters that branches or tools can read directly.

Memory Snapshot

Reusable context that later nodes can extract facts, parameters, or files from.

What Is a Skill?

A reusable execution unit that bundles SOP context, runtime behavior, and a reliable handoff contract.

SOP-Backed

The “how to do this well” playbook lives with the skill instead of being copied across nodes.

Reusable

Publish once, then invoke it from different agents and workflows.

Testable

Run the skill with fixture inputs without triggering the full workflow.

Version-Pinned

Agents can pin a stable version instead of breaking whenever a shared skill changes.

Typical Input Sources

conversation context prior node outputs files memory extraction

"A Skill is the unit of reuse in this model. Think of it as a function for agent behavior — explicit inputs, explicit outputs, testable in isolation." Walk the four properties: "SOP-backed — the 'how to do this well' playbook lives with the skill, not copied into every prompt; update once and it takes effect everywhere. Reusable — publish once, invoke from any agent or workflow. Testable — run it with fixture inputs without triggering the whole workflow; regression testing becomes simple. Version-pinned — agents can pin a stable version instead of breaking when a shared skill changes." Point to the bottom row: "Skills take input from all the sources we just covered: conversation context, prior node outputs, files, and memory extraction." (~1 min)

One SOP Library, Many Entrypoints

Context engineering needs a shared home, not repeated prompt snippets.

Today — SOPs buried in nodes

same SOP pasted again hard to review best practices drift

Better — Shared /sops workspace

write once different entry files version with workflow

In Practice: From Scattered Prompts to a Shared Skill Library

An e-commerce ops team consolidated duplicated customer-service SOPs from 5 separate workflows into one shared Skill — cutting maintenance by 4x.

Before — 5 workflows, each with its own copy

returns flow shipping inquiry complaint handling order exception VIP service

Each workflow contained a near-identical "customer-service script SOP" and "ticket classification logic." Updating one meant updating all five.

After — 1 Skill, 5 entrypoints

shared SOP: service script shared SOP: ticket classification

Each workflow only defines its own entrypoint and unique logic. Shared knowledge is maintained and versioned in one Skill.

"That's the theory. Here's what it looks like in practice." "An e-commerce ops team had five customer service workflows: returns, shipping, complaints, order exceptions, VIP. Each had a near-identical customer-service SOP and ticket classification logic." "Every time policy changed — new returns window, new escalation threshold — someone opened five workflows. Usually missed one. Inconsistency crept in." "After: one Skill with two shared SOPs. Each workflow only defines its own entrypoint and unique logic." Read the quote as if you heard it directly. "That's a 4× maintenance reduction — and it compounds as the team adds more workflows." (~1 min)

Skill + SOP Agent Architecture

Reasoning stays thin. Execution happens in a workspace built around files, commands, and reusable outputs.

Inputs

user request prior node outputs uploaded files

Agent Layer

choose SOP assemble context call skills decide next step

Workspace

/sops commands files versioned skills

Handoffs

text files fields memory HITL

Memory Extraction Makes Context Reusable

Memory stops being an implementation detail and becomes a reusable workflow artifact.

LLM Node A

→

Memory Store

→

Extraction LLM

→

Downstream Node B

Runs, produces context Full context preserved Reads & extracts params/files Receives structured values

Cost & Latency

The extraction LLM call is lightweight: it reads a bounded context window and outputs structured fields. Typical overhead is <1s and <500 tokens.

Fallback on Failure

If extraction fails, the node falls back to the upstream agent's raw text output, so the workflow never silently breaks.

How It Differs from RAG

RAG retrieves from an external corpus; Memory Extraction pulls from the same run's working context. No vector DB needed — this is intra-workflow state, not cross-session retrieval.

"This is one of the most novel ideas today — take a moment with it." Lead with the analogy: "Think of RAG as going to the library. Memory Extraction is reading the notes you already wrote during this run — no library trip, no vector DB, no cross-session retrieval. This is intra-workflow state." Walk the pipeline: "Node A runs and produces context. That context goes into a Memory Store. An Extraction LLM — lightweight, typically under 1 second and 500 tokens — reads the bounded window and pulls out structured fields. Node B gets typed values it can actually route on." Address cost proactively: "Teams ask: doesn't that add latency? The extraction call is lightweight by design. And if it fails, the node falls back to the upstream agent's raw text output — the workflow never silently breaks." RAG difference: "RAG retrieves from an external corpus across sessions. Memory Extraction pulls from the same run's working context. Different problems entirely." → AUDIENCE CHECK-IN: "Any questions before I move on? This is where most 'aha' moments happen — and most confusion lives." (brief pause) (~2 min — the densest slide in the deck)

Part 4

Sandboxed Runtime & Collaboration

Once agents work over SOPs, files, and explicit deliverables, the runtime has to feel useful and safe at the same time.

Command Node: A Small but Powerful Primitive

One command line in, stdout out, files left behind for the next step.

Example

report --input ./turnsheet.csv --format json

command line in stdout out files stay in runtime

Natural for LLMs

Models already understand commands, pipes, and file paths from pretraining.

Smaller Product Surface

You do not need a custom UI for every tiny transformation or helper tool.

Better Handoffs

Bigger artifacts can stay as files and move forward explicitly instead of being squeezed into prompts.

From Tool Lists to a POSIX Workspace

Stop modeling every capability as a bespoke tool card. Let the runtime expose commands, files, and stdout.

Before — Tool-first orchestration

step1: A = google_search(query="Dify", max_size=30)
step2: B = summary(query=A)

hidden conversion outputs stay in memory every tool needs UI

After — POSIX-style execution

summary --query "$(google_search --query dify --max_size 30)"

string interface shell composition inspect with ls /bin

Left side: "Old model — two tool calls, each with its own typed schema. The output of step 1 has to be stored in LLM memory and re-described for step 2. Type conversion is hidden in the framework, artifacts sit in context, every new tool needs its own UI." Right side: "New model — one command using shell composition. summary gets the output of google_search directly via $() substitution. Uniform string interface, native shell composition, and the agent can discover available tools by running ls /bin — no documentation needed." "The runtime becomes simpler for humans AND more legible to LLMs. That's a rare double win — engineering-friendly and model-friendly usually pull in opposite directions." (~1 min)

Sandboxed Code Execution

Agents need a real execution surface — just not the host machine. A sandbox makes both possible at once.

Host System Access

Without isolation, code can read local credentials, environment variables, and files.

No Resource Limits

A bad loop or memory spike can block workers and hurt everyone sharing the runtime.

Supply Chain Risk

Imported packages can quietly exfiltrate workflow data unless the environment is controlled.

Safety Boundary

No host filesystem access
Network restricted by allowlist
CPU and memory limits per run
Timeout configurable per node

Usable Runtime Surface

ls /bin stdin/stdout I/O files as handoff Python 3.11+ JavaScript (Node 20) external file storage

How to Enable

Cloud — on by default Self-hosted — set SANDBOX=true

"Let's put the 'why' and 'what' of the sandbox on one slide." Walk the three risks at the top: "Host system access — without isolation, code can read local credentials and environment variables. No resource limits — a bad loop can block every worker on the shared runtime. Supply chain risk — imported packages from an untrusted source can quietly exfiltrate workflow data." Then pivot to the solution below: "The sandbox addresses all three. Left side: four hard guarantees — no host filesystem access, network restricted by allowlist, CPU and memory limits per run, configurable timeout per node. Right side: everything you still get — ls /bin, stdin/stdout I/O, files as handoff artifacts, Python 3.11+, Node 20, and external file storage." Read the banner: "The goal is safe capability — a real runtime inside hard boundaries." "This is what makes command execution safe enough for production, not just experiments." (~1.5 min)

Observability: Make Every Step Traceable

A production system must not only run — it must be diagnosable when things go wrong and measurable day to day.

Node-Level Tracing

Every node's input, output, latency, and token usage is traced independently, so failures can be pinpointed to the exact step.

Cost Tracking

Token costs broken down by workflow, by node, and by model so teams know where money is going.

Latency Analysis

Is the bottleneck in inference, tool calls, or file I/O? Latency distribution charts make optimization evidence-based.

Error Replay

Failed runs can be replayed with full context — no guessing, no reproduction steps needed.

"'Production-grade' is not just feature completeness — it is diagnosability when things break. The sandbox lets code run; observability tells you what happened." Walk the four cards: "Node-level tracing: every node's inputs, outputs, latency, and token usage are traced independently — failures can be pinpointed to the exact step, no guessing across the whole chain. Cost tracking: broken down by workflow, node, and model — the team knows where money goes and optimization has a target. Latency analysis: is the bottleneck inference, tool calls, or file I/O? Distribution charts beat intuition. Error replay: failed runs replay with full context — no guessing, no manual reproduction." "If you can't explain why a run failed, you can't improve it. Observability closes that loop." (~1 min)

Collaborative Workflow Development

The workflow itself becomes a shared product surface for the team.

Role-Based Access

Different people can draft, review, or publish without stepping on each other.

Version History

Every publish creates a snapshot, so teams can compare and roll back quickly.

Draft → Review → Publish

The lifecycle becomes visible and repeatable instead of living in screenshots and chat messages.

Shared SOP Library

Best practices stop living as private prompt snippets and become team assets.

Simple team flow

One person drafts the workflow, another reviews the SOPs, and a lead publishes the approved version with history still intact.

"The sandbox and observability solve the runtime side. This slide solves the other question: who owns the workflow?" "Once a workflow is an operational asset — not a personal experiment — collaboration isn't optional. It's required." Walk the four cards: "Role-based access: draft, review, publish — each person has the right surface without stepping on each other. Version history: every publish creates a snapshot, rollback is one click, no fear of breaking things. Draft → Review → Publish: the lifecycle becomes visible and repeatable instead of living in screenshots and chat messages. Shared SOP library: best practices stop being private prompt snippets and become shared team property." Read the scenario at the bottom: "One person drafts the workflow, another reviews the SOPs, a lead publishes the approved version with history still intact." "This is the difference between a team that builds workflows and a team that operates them." (~1 min)

Putting It All Together

A production agent system where reasoning, runtime execution, and human review all share explicit deliverables.

Input → Agent Reasoning Layer

user query / files / scheduled trigger choose SOP call skills assemble context

Execution (Sandbox) + HITL Gate

command node skills files + stdio code sandbox pause → review → resume

Deliverables + Observability

text / files / structured fields memory snapshot trace log cost tracking

"This is the architecture diagram worth sharing with engineering leads when proposing adoption." Walk the three tiers: "Input flows into the Agent Reasoning layer — it selects SOPs, calls skills, and assembles context. That feeds the Execution layer — command node, skills, files, sandbox — with HITL gates where human judgment is needed. Everything produces Deliverables plus an observability trace." ★ SAY THIS OUT LOUD — it's the most important line in the talk: "You don't need all three on day one. HITL alone unlocks regulated deployments. Skills alone reduces maintenance debt. Start with whichever pain point is costing you the most right now." Read the banner: "Production-grade means every step leaves behind artifacts the next step can actually use."

Global Community

Open-source ecosystem · GitHub Top 100 project

GitHub Top 100 · Open Source LLMOps

1M+

130K+

GitHub Stars

150+

Countries

1,000+

Contributors

60+

Industries

550M+

Total Downloads

Next Steps

You do not need all three on day one — pick the pain point that hurts most and start today.

Try the HITL Node

Drop a Human Input node into any workflow in the latest Dify release and add your first human checkpoint.

Available now

Explore Agent Skills

Extract your most-copied SOP into your first Skill and experience the efficiency of reuse and versioning.

Coming soon

Join the Community

Star the repo, join Discord, and help shape the future of agent systems with developers worldwide.

langgenius/dify

Thank You

Questions, feedback, or want to explore a specific feature deeper?

GitHub

langgenius/dify

banana@dify.ai

Discord Community

Scan to join and keep the conversation going around Dify and agent systems.

crazywoola (Banana)

Developer Relations @ Dify

Transition to Q&A: "That's the talk. " Then: "Questions? Anything you want to go deeper on? I'm also happy to discuss your specific production setup — come find me after." Keep energy up — the best conversations happen in the 10 minutes after the talk ends. Common questions to be ready for: - "How does HITL handle timeouts if a reviewer is unavailable?" → Configurable timeout per node; routes to a fallback branch automatically. - "Is Memory Extraction a separate LLM call I pay for?" → Yes, but lightweight by design (<500 tokens, <1s). Can be disabled with fallback to raw text. - "When are Skills generally available?" → Coming soon — follow the GitHub roadmap. - "Can I use my own sandbox environment?" → Self-hosted lets you configure the sandbox; Cloud provides it managed.

From Autonomous AIto Production-GradeAgent Systems

Agents Can Act — But Production Is Not Ready

Hallucinations Reach Users

Compliance Cannot Close

Fragile Toolchains

Session Agenda

The Evolution of AI Systems

Three Generations of LLM Applications

Prompt → Response

Pipeline Orchestration

Agent Systems

Three Architectural Advances

Human-in-the-Loop (HITL)

Why Teams Need HITL & Where It Fits

How a Human Input Node Works

Liang · Investment Services

Min · Global Support Team

Agent × Skills

From Monolithic Prompt to Thin Orchestrator

Missing Deliverables Break Workflows

What a Node Should Hand Off

What Is a Skill?

SOP-Backed

Reusable

Testable

Version-Pinned

One SOP Library, Many Entrypoints

In Practice: From Scattered Prompts to a Shared Skill Library

Skill + SOP Agent Architecture

Memory Extraction Makes Context Reusable

Sandboxed Runtime & Collaboration

Command Node: A Small but Powerful Primitive

From Tool Lists to a POSIX Workspace

Sandboxed Code Execution

Host System Access

No Resource Limits

Supply Chain Risk

Observability: Make Every Step Traceable

Node-Level Tracing

Cost Tracking

Latency Analysis

Error Replay

Collaborative Workflow Development

Role-Based Access

Version History

Draft → Review → Publish

Shared SOP Library

Putting It All Together

Global Community

Next Steps

Try the HITL Node

Explore Agent Skills

Join the Community

Thank You

From Autonomous AI
to Production-Grade
Agent Systems