← All case studies
Harmony HealthHealthcare
Harmony: Automated Questionnaire Data Extraction
Structured extraction pipeline with schema validation and human‑in‑the‑loop review hit 98% key‑entity accuracy and cut manual entry time by 95%.
- Entity extraction accuracy
- 98%
- Reduction in manual entry time
- 95%
- Questionnaires processed per hour
- 1,000+
Challenge
Harmony needed to process large volumes of questionnaires and turn them into clean, queryable datasets. Manual data entry was slow, error‑prone, and couldn't keep up with growing document volumes or the variety of layouts coming in from different programs.
Solution
We built a structured extraction pipeline that combines layout‑aware parsing, LLM extraction with schema validation, and a lightweight review queue for low‑confidence cases.
- Layout‑aware ingestion. Documents are parsed with a structure‑aware pipeline that preserves question groupings, response options, and demographic blocks across PDF, scanned, and web‑form sources.
- Schema‑constrained extraction. A typed Pydantic schema defines respondent metadata, question IDs, response categories, and free‑text fields. The LLM extractor returns JSON validated against the schema; failures route to a repair step before retry.
- Question classification. A multi‑label classifier tags each question (multiple choice, open‑ended, Likert, demographic, etc.) so downstream analytics can pivot consistently across programs.
- Evaluation and human review. A golden set of fully labeled questionnaires powers regression tests on every model or prompt change. Per‑field confidence scores route low‑confidence extractions to a reviewer UI; corrections flow back into the eval set.
- Observability. Run‑level traces capture inputs, intermediate parses, and final outputs so any field in any record can be audited back to its source span.
Results
- 95% reduction in manual data entry time.
- 98% extraction accuracy for key entities on the held‑out evaluation set.
- Throughput of 1,000+ questionnaires per hour vs. roughly 10 per hour manually.
- Eliminated transcription errors and enabled near real‑time analysis of incoming questionnaires.
Technical Highlights
- Layout‑aware document parsing with table and grouping preservation.
- Schema‑constrained LLM extraction with automatic repair on validation failure.
- Multi‑label question‑type classifier for consistent downstream analytics.
- Golden‑set evaluations in CI; confidence‑based routing to a human review queue.
Drowning in unstructured forms?
We build NLP pipelines that turn questionnaires and free‑text into clean, queryable data.