Harmony: Automated Questionnaire Data Extraction • Plexibit

Challenge

Harmony needed to process large volumes of questionnaires and turn them into clean, queryable datasets. Manual data entry was slow, error‑prone, and couldn't keep up with growing document volumes or the variety of layouts coming in from different programs.

Solution

We built a structured extraction pipeline that combines layout‑aware parsing, LLM extraction with schema validation, and a lightweight review queue for low‑confidence cases.

Layout‑aware ingestion. Documents are parsed with a structure‑aware pipeline that preserves question groupings, response options, and demographic blocks across PDF, scanned, and web‑form sources.
Schema‑constrained extraction. A typed Pydantic schema defines respondent metadata, question IDs, response categories, and free‑text fields. The LLM extractor returns JSON validated against the schema; failures route to a repair step before retry.
Question classification. A multi‑label classifier tags each question (multiple choice, open‑ended, Likert, demographic, etc.) so downstream analytics can pivot consistently across programs.
Evaluation and human review. A golden set of fully labeled questionnaires powers regression tests on every model or prompt change. Per‑field confidence scores route low‑confidence extractions to a reviewer UI; corrections flow back into the eval set.
Observability. Run‑level traces capture inputs, intermediate parses, and final outputs so any field in any record can be audited back to its source span.

Results

95% reduction in manual data entry time.
98% extraction accuracy for key entities on the held‑out evaluation set.
Throughput of 1,000+ questionnaires per hour vs. roughly 10 per hour manually.
Eliminated transcription errors and enabled near real‑time analysis of incoming questionnaires.

Technical Highlights

Layout‑aware document parsing with table and grouping preservation.
Schema‑constrained LLM extraction with automatic repair on validation failure.
Multi‑label question‑type classifier for consistent downstream analytics.
Golden‑set evaluations in CI; confidence‑based routing to a human review queue.

Challenge

Solution

Results

Technical Highlights

Drowning in unstructured forms?