Northwind Capital: DPO‑Aligned Research Agents for Buy‑Side Analysts
DPO‑fine‑tuned Llama 3.1 70B served on vLLM, orchestrated by a LangGraph multi‑agent workflow — 41% accuracy lift on analyst Q&A, p95 latency under 2.4s, and 63% lower inference cost vs. the prior GPT‑4 baseline.
- Analyst Q&A accuracy lift
- +41%
- p95 end‑to‑end latency
- 2.4s
- Inference cost reduction
- 63%
- Throughput per GPU
- 3.2×
Challenge
Northwind Capital, a mid‑market asset manager, needed an internal research copilot that could synthesize 10‑Ks, earnings transcripts, broker notes, and proprietary models on demand. Their first‑generation RAG assistant — a single GPT‑4 call over a vector store — was costly, slow during earnings season, and frequently produced answers that disagreed with the firm's house view on accounting treatments and risk framing. Compliance also required every answer to carry a verifiable citation trail and pass a documented evaluation suite before release.
Solution
We rebuilt the system as an aligned, agentic platform deployed inside Northwind's private VPC.
- Preference data + DPO fine‑tuning. We worked with eight senior analysts to label 14k pairwise preferences over draft answers (house‑view conformity, citation discipline, hedging language). Using TRL with DPO on top of Llama 3.1 70B Instruct with QLoRA adapters, we shifted the base model toward Northwind's analytical voice without retraining from scratch. A second DPO pass targeted refusal behavior on regulated topics (MNPI, personalized investment advice).
- LangGraph multi‑agent orchestration. A supervisor graph routes each query across specialist nodes — Filings Retriever, Transcript Analyst, Quant Lookup, Risk Reviewer, Citation Auditor — with explicit state, retries, and human‑in‑the‑loop checkpoints for trades‑desk‑adjacent questions. Tool calls go through MCP servers wrapping the internal data warehouse, Bloomberg, and the firm's factor library, so every tool boundary is auditable.
- Hybrid retrieval. BM25 over OpenSearch fused with dense embeddings in Qdrant using reciprocal rank fusion; a cross‑encoder reranker (bge‑reranker‑v2) trims to the top 8 chunks per node. Filings are chunked with a structure‑aware splitter that keeps tables intact.
- vLLM serving. The aligned 70B model runs on vLLM with tensor parallelism across 4× H100, paged attention, prefix caching for system prompts, and speculative decoding using a 7B draft model. Two replicas behind a token‑aware router give us headroom through earnings windows.
- Continuous evaluation. A Ragas suite (faithfulness, answer relevancy, context precision/recall) plus a 320‑question golden set graded by an LLM‑as‑judge runs on every model and prompt change. Langfuse captures every trace in production with PII scrubbing; weekly drift reports are reviewed by the compliance lead.
- Guardrails. Input/output filters block MNPI patterns and personalized advice; the Citation Auditor node fails closed if any quantitative claim lacks a resolvable source span.
Results
- 41% lift in analyst Q&A accuracy on the golden set vs. the prior GPT‑4 RAG baseline (78.6% → 92.4% faithfulness on Ragas).
- p95 latency 2.4s end‑to‑end for single‑hop questions; 6.1s for multi‑agent deep‑dives — down from 9–14s previously.
- 63% reduction in per‑query inference cost after moving to self‑hosted vLLM.
- 3.2× throughput per GPU after enabling prefix caching and speculative decoding.
- Zero compliance escalations across the first 90 days of production use; 100% of answers carried passing citation audits.
- Adopted by 74 analysts and PMs; median session now replaces ~35 minutes of manual filing review.
Technical Highlights
- TRL DPO with QLoRA on Llama 3.1 70B; β = 0.1, two‑stage curriculum (style, then safety).
- LangGraph supervisor with typed state, durable checkpointing, and per‑node retry policies.
- MCP servers expose warehouse, Bloomberg, and factor‑library tools with per‑tool authz.
- vLLM with tensor parallel = 4, prefix caching, and 7B draft model for speculative decoding.
- Ragas + golden‑set evals wired into CI; promotion gates on faithfulness ≥ 0.90 and context precision ≥ 0.85.
- Langfuse traces with PII scrubbing; weekly drift and refusal‑rate reports.
- Deployed in Northwind's private VPC with VPC‑only egress and KMS‑encrypted adapters.
Building aligned agents for a regulated workflow?
We design DPO‑aligned, evaluated, and audited agent platforms that ship inside your VPC.