Production AI Agents with LangGraph and MCP: A Practical Architecture Guide
Agent demos look magical. Production agents look like distributed systems with an LLM in the loop — and they fail in all the ways distributed systems fail, plus a few new ones.
After shipping agents into BFSI, healthcare, and manufacturing workflows, our default stack is LangGraph for orchestration, MCP for tools, Langfuse for observability, and explicit guardrails at every boundary. Here's the playbook.
Single-Agent vs Multi-Agent: Pick the Boring One First
The first decision is the one teams get wrong most often.
Use a single agent when:
- The task fits in one mental model (research, ticket triage, document Q&A)
- Tools are fewer than 15 and roles don't overlap
- Latency budget is under ~10 seconds end-to-end
- You can describe the workflow as a flowchart on one page
Use multi-agent when:
- You have genuinely separable specialties (planner / coder / reviewer)
- Context windows blow up if one agent owns everything
- Sub-tasks can run in parallel and merge cleanly
- You need different models per role (cheap router → expensive reasoner)
In practice, 80% of "we need multi-agent" turns out to be "we need a better state machine." Multi-agent buys flexibility at the cost of token spend, latency, and entire new failure modes (agents looping, agents disagreeing, agents hallucinating each other's outputs). Start with one agent and a tight graph.
LangGraph: Treat the Agent as a State Machine
LangGraph's value is that it forces you to draw the graph. Nodes are deterministic Python functions. Edges are conditional. State is explicit and typed. The LLM is just one node among many.
The snippet below is a minimal runnable demo — copy it into agent.py, pip install langgraph langchain-openai, set OPENAI_API_KEY, and run python agent.py. The vector store, checkpointer, and handoff are stubbed so you can swap in real implementations one at a time.
# agent.py — minimal runnable LangGraph demo
import os
from typing import Annotated, Literal, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, AIMessage, HumanMessage
# --- Stubs you would replace in production -------------------------------
class FakeVectorStore:
def similarity_search(self, query: str, k: int = 5):
return [
{"id": "doc-1", "content": f"Example context for: {query}"},
{"id": "doc-2", "content": "Additional supporting passage."},
][:k]
vector_store = FakeVectorStore()
checkpointer = MemorySaver() # swap for PostgresSaver / RedisSaver in prod
def human_handoff(state: "AgentState") -> "AgentState":
return {"messages": [AIMessage(content="[Routed to a human reviewer.]")]}
# -------------------------------------------------------------------------
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
retrieved_docs: list[dict]
tool_calls_made: int
needs_human: bool
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def retrieve(state: AgentState) -> AgentState:
"""Pull supporting context before the LLM ever sees the question."""
query = state["messages"][-1].content
docs = vector_store.similarity_search(query, k=5)
return {"retrieved_docs": docs}
def reason(state: AgentState) -> AgentState:
context = "\n\n".join(d["content"] for d in state["retrieved_docs"])
system = SystemMessage(
content=f"Answer using ONLY this context. If unsure, say so.\n\n{context}"
)
response = llm.invoke([system, *state["messages"]])
return {
"messages": [response],
"tool_calls_made": state["tool_calls_made"] + 1,
}
def tool_executor(state: AgentState) -> AgentState:
# Stand-in: a real executor dispatches state["messages"][-1].tool_calls
return {"messages": [AIMessage(content="[tool result placeholder]")]}
def route(state: AgentState) -> Literal["tools", "handoff", "end"]:
last = state["messages"][-1]
if state["tool_calls_made"] >= 6:
return "handoff" # hard ceiling on agent loops
if state.get("needs_human"):
return "handoff"
if isinstance(last, AIMessage) and getattr(last, "tool_calls", None):
return "tools"
return "end"
graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve)
graph.add_node("reason", reason)
graph.add_node("tools", tool_executor)
graph.add_node("handoff", human_handoff)
graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "reason")
graph.add_conditional_edges(
"reason", route,
{"tools": "tools", "handoff": "handoff", "end": END},
)
graph.add_edge("tools", "reason")
graph.add_edge("handoff", END)
app = graph.compile(checkpointer=checkpointer)
if __name__ == "__main__":
assert os.environ.get("OPENAI_API_KEY"), "Set OPENAI_API_KEY first"
result = app.invoke(
{
"messages": [HumanMessage(content="Summarize the retrieved context.")],
"retrieved_docs": [],
"tool_calls_made": 0,
"needs_human": False,
},
config={"configurable": {"thread_id": "demo-1"}},
)
print(result["messages"][-1].content)
A few things to notice:
- State is a typed dict, not a free-form scratchpad. You can serialize, replay, and audit every transition.
- There is a hard cap on tool calls (
tool_calls_made >= 6). Without this, your agent will happily burn $40 in tokens deciding it needs to read one more document. - Human handoff is a first-class node, not an exception path. Regulated workflows need this.
- The checkpointer (Postgres or Redis) means runs are resumable. If a tool fails, you don't replay the LLM from scratch.
MCP: Stop Wrapping Tools by Hand
Every team we've worked with has the same pattern: a tools/ folder full of bespoke @tool decorators wrapping the same APIs as last quarter's project. Model Context Protocol (MCP) kills this. Tools live behind a standard server, and any MCP-aware client (Claude Desktop, Cursor, your LangGraph agent) can call them.
The win is operational: tools get versioned, permissioned, and observed in one place instead of duplicated across five agent codebases.
# mcp_server.py — exposes internal tools over MCP
from mcp.server.fastmcp import FastMCP
import httpx
mcp = FastMCP("plexibit-internal")
@mcp.tool()
async def lookup_customer(customer_id: str) -> dict:
"""Fetch customer record from CRM. Read-only."""
async with httpx.AsyncClient() as client:
r = await client.get(f"https://crm.internal/api/customers/{customer_id}")
r.raise_for_status()
return r.json()
@mcp.tool()
async def create_ticket(subject: str, body: str, priority: str = "normal") -> dict:
"""Create a support ticket. Requires human approval for priority='high'."""
if priority == "high":
raise PermissionError("High-priority tickets require human approval")
async with httpx.AsyncClient() as client:
r = await client.post(
"https://helpdesk.internal/api/tickets",
json={"subject": subject, "body": body, "priority": priority},
)
return r.json()
if __name__ == "__main__":
mcp.run(transport="stdio")
Wiring it into LangGraph (runnable as python mcp_agent.py after pip install langgraph langchain-openai langchain-mcp-adapters and starting mcp_server.py on PATH):
# mcp_agent.py — minimal runnable MCP + LangGraph demo
import asyncio
import os
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
from langchain_mcp_adapters.client import MultiServerMCPClient
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
needs_human: bool
async def build_app():
mcp_client = MultiServerMCPClient({
"internal": {
"command": "python",
"args": ["mcp_server.py"],
"transport": "stdio",
},
# "search": {"url": "https://search.internal/mcp",
# "transport": "streamable_http"},
})
tools = await mcp_client.get_tools()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0).bind_tools(tools)
async def reason(state: AgentState) -> AgentState:
response = await llm.ainvoke(state["messages"])
return {"messages": [response]}
async def tool_executor(state: AgentState) -> AgentState:
last = state["messages"][-1]
results: list = []
for call in getattr(last, "tool_calls", []) or []:
tool = next(t for t in tools if t.name == call["name"])
try:
result = await tool.ainvoke(call["args"])
except PermissionError as e:
return {
"needs_human": True,
"messages": [AIMessage(content=str(e))],
}
results.append(ToolMessage(content=str(result), tool_call_id=call["id"]))
return {"messages": results}
def route(state: AgentState):
last = state["messages"][-1]
if state.get("needs_human"):
return END
if isinstance(last, AIMessage) and getattr(last, "tool_calls", None):
return "tools"
return END
graph = StateGraph(AgentState)
graph.add_node("reason", reason)
graph.add_node("tools", tool_executor)
graph.add_edge(START, "reason")
graph.add_conditional_edges("reason", route, {"tools": "tools", END: END})
graph.add_edge("tools", "reason")
return graph.compile()
async def main():
assert os.environ.get("OPENAI_API_KEY"), "Set OPENAI_API_KEY first"
app = await build_app()
result = await app.ainvoke({
"messages": [HumanMessage(content="Look up customer 42 and summarize.")],
"needs_human": False,
})
print(result["messages"][-1].content)
if __name__ == "__main__":
asyncio.run(main())
Practical tips:
- Run one MCP server per trust boundary (read-only data, write-capable systems, external APIs). Don't mix.
- Tools that mutate state should refuse to run without an approval token in their args.
- Cache tool schemas at startup. Re-fetching on every request is a real latency cost.
Observability: If You Can't Trace It, You Can't Ship It
Agents fail silently. A node returns wrong context, the LLM confidently summarizes it, the user gets garbage. Without traces, you find out from a support ticket three weeks later.
Langfuse (self-hosted or cloud) gives per-step traces, token cost, latency, and user feedback in one place, and it integrates cleanly with LangGraph via the callback handler.
# observed_agent.py — drop-in observability wrapper for the demo above
import asyncio
import os
from langchain_core.messages import HumanMessage
from langfuse.callback import CallbackHandler
from mcp_agent import build_app # from the previous snippet
async def main():
langfuse_handler = CallbackHandler(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
)
app = await build_app()
user_query = "Look up customer 42 and summarize."
session_id, user_id, tenant_id = "demo-session", "user-1", "tenant-acme"
result = await app.ainvoke(
{"messages": [HumanMessage(content=user_query)], "needs_human": False},
config={
"callbacks": [langfuse_handler],
"configurable": {
"thread_id": session_id,
"user_id": user_id,
"metadata": {"tenant": tenant_id, "agent_version": "v1.4.2"},
},
},
)
print(result["messages"][-1].content)
if __name__ == "__main__":
asyncio.run(main())
What we actually look at in Langfuse weekly:
- Tool-call distribution — if one tool dominates, the prompt is broken
- Loops per run — anything above the median is a candidate for graph changes
- Cost per resolved task, not cost per call — the only metric finance cares about
- Thumbs-down traces — the fastest path to a better eval set
Pair this with LangSmith or Langfuse evals running on every PR. Treat regressions in agent behavior the way you'd treat a failing unit test.
Guardrails: Defense in Depth, Not a Single Filter
Guardrails are not "add a profanity filter and call it done." We layer them at four points:
- Input — schema validation, prompt-injection detection (e.g. Llama Guard, Prompt Guard, or a small classifier), PII redaction before the LLM ever sees the message
- Tool boundary — every tool re-validates its args; destructive tools require an explicit approval token; rate limits per user and per tool
- Model output — JSON schema enforcement (Pydantic), policy checks, citation requirements ("every claim must reference a retrieved doc id")
- Action boundary — for anything that mutates state, dry-run first, log the diff, require human approval over a threshold
# guardrail.py — runnable demo of a citation-grounded output validator
from typing import TypedDict
from pydantic import BaseModel, Field, ValidationError
from langchain_core.messages import AIMessage
class AgentState(TypedDict, total=False):
messages: list
retrieved_docs: list[dict]
needs_human: bool
class AgentResponse(BaseModel):
answer: str = Field(min_length=1, max_length=2000)
citations: list[str] = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
requires_human_review: bool
def validate_output(state: AgentState) -> AgentState:
raw = state["messages"][-1].content
try:
parsed = AgentResponse.model_validate_json(raw)
except ValidationError:
return {"needs_human": True}
valid_ids = {d["id"] for d in state.get("retrieved_docs", [])}
if not set(parsed.citations).issubset(valid_ids):
return {"needs_human": True} # hallucinated a citation
if parsed.confidence < 0.6:
return {"needs_human": True}
return {"messages": [AIMessage(content=parsed.answer)]}
if __name__ == "__main__":
sample = AIMessage(content='{"answer":"Bronchitis maps to J20.9.",'
'"citations":["doc-1"],"confidence":0.92,'
'"requires_human_review":false}')
state: AgentState = {
"messages": [sample],
"retrieved_docs": [{"id": "doc-1", "content": "ICD-10 J20.9 ..."}],
"needs_human": False,
}
print(validate_output(state))
The cheapest, most effective guardrail we deploy: require structured output with citations to retrieved doc IDs. Hallucinations drop sharply because the model has to ground every claim in something it was actually shown.
A Production Checklist
Before an agent goes live, we walk through this list:
- [ ] Graph drawn on one page, every node and edge labelled
- [ ] Hard ceiling on tool calls, loops, and total tokens per run
- [ ] State is typed and persisted to a checkpointer (Postgres/Redis)
- [ ] Tools live behind MCP servers split by trust boundary
- [ ] Mutating tools require approval tokens; rate-limited per user
- [ ] Langfuse (or equivalent) traces every node with cost and latency
- [ ] Eval set with ≥50 representative cases runs on every PR
- [ ] Output validated against a Pydantic schema with required citations
- [ ] Human-handoff path is a first-class node, not an exception
- [ ] Runbook for: agent loops, tool outage, model regression, cost spike
What This Buys You
The teams that ship reliable agents share one habit: they treat the LLM as the smallest, most replaceable part of the system. The graph, the tools, the traces, and the guardrails are where the engineering happens. The model is a swappable component you'll upgrade three times this year anyway.
If you're past prototypes and trying to put agents in front of real users — especially in regulated industries — let's talk. We've made these mistakes already so you don't have to.