Skip to content

Production AI Agents with LangGraph and MCP: A Practical Architecture Guide

5/10/2026

Agent demos look magical. Production agents look like distributed systems with an LLM in the loop — and they fail in all the ways distributed systems fail, plus a few new ones.

After shipping agents into BFSI, healthcare, and manufacturing workflows, our default stack is LangGraph for orchestration, MCP for tools, Langfuse for observability, and explicit guardrails at every boundary. Here's the playbook.

Single-Agent vs Multi-Agent: Pick the Boring One First

The first decision is the one teams get wrong most often.

Use a single agent when:

Use multi-agent when:

In practice, 80% of "we need multi-agent" turns out to be "we need a better state machine." Multi-agent buys flexibility at the cost of token spend, latency, and entire new failure modes (agents looping, agents disagreeing, agents hallucinating each other's outputs). Start with one agent and a tight graph.

LangGraph: Treat the Agent as a State Machine

LangGraph's value is that it forces you to draw the graph. Nodes are deterministic Python functions. Edges are conditional. State is explicit and typed. The LLM is just one node among many.

The snippet below is a minimal runnable demo — copy it into agent.py, pip install langgraph langchain-openai, set OPENAI_API_KEY, and run python agent.py. The vector store, checkpointer, and handoff are stubbed so you can swap in real implementations one at a time.

# agent.py — minimal runnable LangGraph demo
import os
from typing import Annotated, Literal, TypedDict

from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, AIMessage, HumanMessage

# --- Stubs you would replace in production -------------------------------
class FakeVectorStore:
    def similarity_search(self, query: str, k: int = 5):
        return [
            {"id": "doc-1", "content": f"Example context for: {query}"},
            {"id": "doc-2", "content": "Additional supporting passage."},
        ][:k]

vector_store = FakeVectorStore()
checkpointer = MemorySaver()   # swap for PostgresSaver / RedisSaver in prod

def human_handoff(state: "AgentState") -> "AgentState":
    return {"messages": [AIMessage(content="[Routed to a human reviewer.]")]}
# -------------------------------------------------------------------------

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    retrieved_docs: list[dict]
    tool_calls_made: int
    needs_human: bool

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def retrieve(state: AgentState) -> AgentState:
    """Pull supporting context before the LLM ever sees the question."""
    query = state["messages"][-1].content
    docs = vector_store.similarity_search(query, k=5)
    return {"retrieved_docs": docs}

def reason(state: AgentState) -> AgentState:
    context = "\n\n".join(d["content"] for d in state["retrieved_docs"])
    system = SystemMessage(
        content=f"Answer using ONLY this context. If unsure, say so.\n\n{context}"
    )
    response = llm.invoke([system, *state["messages"]])
    return {
        "messages": [response],
        "tool_calls_made": state["tool_calls_made"] + 1,
    }

def tool_executor(state: AgentState) -> AgentState:
    # Stand-in: a real executor dispatches state["messages"][-1].tool_calls
    return {"messages": [AIMessage(content="[tool result placeholder]")]}

def route(state: AgentState) -> Literal["tools", "handoff", "end"]:
    last = state["messages"][-1]
    if state["tool_calls_made"] >= 6:
        return "handoff"            # hard ceiling on agent loops
    if state.get("needs_human"):
        return "handoff"
    if isinstance(last, AIMessage) and getattr(last, "tool_calls", None):
        return "tools"
    return "end"

graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve)
graph.add_node("reason", reason)
graph.add_node("tools", tool_executor)
graph.add_node("handoff", human_handoff)
graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "reason")
graph.add_conditional_edges(
    "reason", route,
    {"tools": "tools", "handoff": "handoff", "end": END},
)
graph.add_edge("tools", "reason")
graph.add_edge("handoff", END)

app = graph.compile(checkpointer=checkpointer)

if __name__ == "__main__":
    assert os.environ.get("OPENAI_API_KEY"), "Set OPENAI_API_KEY first"
    result = app.invoke(
        {
            "messages": [HumanMessage(content="Summarize the retrieved context.")],
            "retrieved_docs": [],
            "tool_calls_made": 0,
            "needs_human": False,
        },
        config={"configurable": {"thread_id": "demo-1"}},
    )
    print(result["messages"][-1].content)

A few things to notice:

MCP: Stop Wrapping Tools by Hand

Every team we've worked with has the same pattern: a tools/ folder full of bespoke @tool decorators wrapping the same APIs as last quarter's project. Model Context Protocol (MCP) kills this. Tools live behind a standard server, and any MCP-aware client (Claude Desktop, Cursor, your LangGraph agent) can call them.

The win is operational: tools get versioned, permissioned, and observed in one place instead of duplicated across five agent codebases.

# mcp_server.py — exposes internal tools over MCP
from mcp.server.fastmcp import FastMCP
import httpx

mcp = FastMCP("plexibit-internal")

@mcp.tool()
async def lookup_customer(customer_id: str) -> dict:
    """Fetch customer record from CRM. Read-only."""
    async with httpx.AsyncClient() as client:
        r = await client.get(f"https://crm.internal/api/customers/{customer_id}")
        r.raise_for_status()
        return r.json()

@mcp.tool()
async def create_ticket(subject: str, body: str, priority: str = "normal") -> dict:
    """Create a support ticket. Requires human approval for priority='high'."""
    if priority == "high":
        raise PermissionError("High-priority tickets require human approval")
    async with httpx.AsyncClient() as client:
        r = await client.post(
            "https://helpdesk.internal/api/tickets",
            json={"subject": subject, "body": body, "priority": priority},
        )
        return r.json()

if __name__ == "__main__":
    mcp.run(transport="stdio")

Wiring it into LangGraph (runnable as python mcp_agent.py after pip install langgraph langchain-openai langchain-mcp-adapters and starting mcp_server.py on PATH):

# mcp_agent.py — minimal runnable MCP + LangGraph demo
import asyncio
import os
from typing import Annotated, TypedDict

from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
from langchain_mcp_adapters.client import MultiServerMCPClient


class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    needs_human: bool


async def build_app():
    mcp_client = MultiServerMCPClient({
        "internal": {
            "command": "python",
            "args": ["mcp_server.py"],
            "transport": "stdio",
        },
        # "search": {"url": "https://search.internal/mcp",
        #            "transport": "streamable_http"},
    })
    tools = await mcp_client.get_tools()
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0).bind_tools(tools)

    async def reason(state: AgentState) -> AgentState:
        response = await llm.ainvoke(state["messages"])
        return {"messages": [response]}

    async def tool_executor(state: AgentState) -> AgentState:
        last = state["messages"][-1]
        results: list = []
        for call in getattr(last, "tool_calls", []) or []:
            tool = next(t for t in tools if t.name == call["name"])
            try:
                result = await tool.ainvoke(call["args"])
            except PermissionError as e:
                return {
                    "needs_human": True,
                    "messages": [AIMessage(content=str(e))],
                }
            results.append(ToolMessage(content=str(result), tool_call_id=call["id"]))
        return {"messages": results}

    def route(state: AgentState):
        last = state["messages"][-1]
        if state.get("needs_human"):
            return END
        if isinstance(last, AIMessage) and getattr(last, "tool_calls", None):
            return "tools"
        return END

    graph = StateGraph(AgentState)
    graph.add_node("reason", reason)
    graph.add_node("tools", tool_executor)
    graph.add_edge(START, "reason")
    graph.add_conditional_edges("reason", route, {"tools": "tools", END: END})
    graph.add_edge("tools", "reason")
    return graph.compile()


async def main():
    assert os.environ.get("OPENAI_API_KEY"), "Set OPENAI_API_KEY first"
    app = await build_app()
    result = await app.ainvoke({
        "messages": [HumanMessage(content="Look up customer 42 and summarize.")],
        "needs_human": False,
    })
    print(result["messages"][-1].content)


if __name__ == "__main__":
    asyncio.run(main())

Practical tips:

Observability: If You Can't Trace It, You Can't Ship It

Agents fail silently. A node returns wrong context, the LLM confidently summarizes it, the user gets garbage. Without traces, you find out from a support ticket three weeks later.

Langfuse (self-hosted or cloud) gives per-step traces, token cost, latency, and user feedback in one place, and it integrates cleanly with LangGraph via the callback handler.

# observed_agent.py — drop-in observability wrapper for the demo above
import asyncio
import os
from langchain_core.messages import HumanMessage
from langfuse.callback import CallbackHandler

from mcp_agent import build_app   # from the previous snippet


async def main():
    langfuse_handler = CallbackHandler(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
    )

    app = await build_app()
    user_query = "Look up customer 42 and summarize."
    session_id, user_id, tenant_id = "demo-session", "user-1", "tenant-acme"

    result = await app.ainvoke(
        {"messages": [HumanMessage(content=user_query)], "needs_human": False},
        config={
            "callbacks": [langfuse_handler],
            "configurable": {
                "thread_id": session_id,
                "user_id": user_id,
                "metadata": {"tenant": tenant_id, "agent_version": "v1.4.2"},
            },
        },
    )
    print(result["messages"][-1].content)


if __name__ == "__main__":
    asyncio.run(main())

What we actually look at in Langfuse weekly:

Pair this with LangSmith or Langfuse evals running on every PR. Treat regressions in agent behavior the way you'd treat a failing unit test.

Guardrails: Defense in Depth, Not a Single Filter

Guardrails are not "add a profanity filter and call it done." We layer them at four points:

  1. Input — schema validation, prompt-injection detection (e.g. Llama Guard, Prompt Guard, or a small classifier), PII redaction before the LLM ever sees the message
  2. Tool boundary — every tool re-validates its args; destructive tools require an explicit approval token; rate limits per user and per tool
  3. Model output — JSON schema enforcement (Pydantic), policy checks, citation requirements ("every claim must reference a retrieved doc id")
  4. Action boundary — for anything that mutates state, dry-run first, log the diff, require human approval over a threshold
# guardrail.py — runnable demo of a citation-grounded output validator
from typing import TypedDict
from pydantic import BaseModel, Field, ValidationError
from langchain_core.messages import AIMessage


class AgentState(TypedDict, total=False):
    messages: list
    retrieved_docs: list[dict]
    needs_human: bool


class AgentResponse(BaseModel):
    answer: str = Field(min_length=1, max_length=2000)
    citations: list[str] = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    requires_human_review: bool


def validate_output(state: AgentState) -> AgentState:
    raw = state["messages"][-1].content
    try:
        parsed = AgentResponse.model_validate_json(raw)
    except ValidationError:
        return {"needs_human": True}

    valid_ids = {d["id"] for d in state.get("retrieved_docs", [])}
    if not set(parsed.citations).issubset(valid_ids):
        return {"needs_human": True}        # hallucinated a citation
    if parsed.confidence < 0.6:
        return {"needs_human": True}
    return {"messages": [AIMessage(content=parsed.answer)]}


if __name__ == "__main__":
    sample = AIMessage(content='{"answer":"Bronchitis maps to J20.9.",'
                               '"citations":["doc-1"],"confidence":0.92,'
                               '"requires_human_review":false}')
    state: AgentState = {
        "messages": [sample],
        "retrieved_docs": [{"id": "doc-1", "content": "ICD-10 J20.9 ..."}],
        "needs_human": False,
    }
    print(validate_output(state))

The cheapest, most effective guardrail we deploy: require structured output with citations to retrieved doc IDs. Hallucinations drop sharply because the model has to ground every claim in something it was actually shown.

A Production Checklist

Before an agent goes live, we walk through this list:

What This Buys You

The teams that ship reliable agents share one habit: they treat the LLM as the smallest, most replaceable part of the system. The graph, the tools, the traces, and the guardrails are where the engineering happens. The model is a swappable component you'll upgrade three times this year anyway.

If you're past prototypes and trying to put agents in front of real users — especially in regulated industries — let's talk. We've made these mistakes already so you don't have to.