Fine-Tuning LLMs for Domain Expertise: When and How to Do It Right
The question we hear most: "Should we fine-tune or use RAG?"
The answer: Start with RAG. Fine-tune when RAG isn't enough.
When Fine-Tuning Makes Sense
Fine-tuning is worth it when:
- Domain-specific language - Medical abbreviations, legal jargon, industry acronyms
- Consistent formatting - Always output JSON, follow specific templates
- Behavioral adaptation - More concise, formal tone for healthcare vs casual for retail
- Proprietary knowledge - Internal processes, company-specific workflows
Don't fine-tune if:
- You have less than 1,000 high-quality examples
- Your knowledge changes weekly (use RAG)
- Prompt engineering + RAG gets you to 90%+ accuracy
The Fine-Tuning Pipeline
Step 1: Prepare Training Data
Quality > quantity. Format examples carefully:
# For instruction fine-tuning (OpenAI, Llama, Mistral)
training_data = [
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant..."},
{"role": "user", "content": "Patient presents with acute bronchitis..."},
{"role": "assistant", "content": "ICD-10: J20.9, CPT: 99213..."}
]
},
# ... 1000+ more examples
]
# Save as JSONL
import json
with open("training.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")
Step 2: Data Quality Checklist
def validate_training_data(examples):
"""Ensure high-quality fine-tuning data"""
issues = []
for i, ex in enumerate(examples):
# Check length (not too short, not too long)
user_msg = ex["messages"][1]["content"]
assistant_msg = ex["messages"][2]["content"]
if len(user_msg) < 10:
issues.append(f"Example {i}: User message too short")
if len(assistant_msg) < 10:
issues.append(f"Example {i}: Assistant message too short")
if len(user_msg) > 4000:
issues.append(f"Example {i}: User message too long")
# Check diversity (no duplicate inputs)
# Check formatting consistency
# Check that assistant doesn't refuse tasks
return issues
Pro tips:
- Mix 80% domain examples + 20% general examples (prevents forgetting)
- Include edge cases and error handling
- Validate formatting consistency across all examples
Step 3: Choose Your Approach
Option A: Full Fine-Tuning (OpenAI, Anthropic)
import openai
import time
# Upload training file
file = openai.File.create(
file=open("training.jsonl", "rb"),
purpose="fine-tune"
)
# Start fine-tuning job
job = openai.FineTuningJob.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3,
"batch_size": 1,
"learning_rate_multiplier": 1.8
}
)
# Monitor progress
while job.status != "succeeded":
job = openai.FineTuningJob.retrieve(job.id)
print(f"Status: {job.status}")
time.sleep(60)
model_name = job.fine_tuned_model
Option B: LoRA Fine-Tuning (Llama, Mistral, OSS models)
More efficient for larger models:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
load_in_8bit=True,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train (using Hugging Face Trainer)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./medical-mistral-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
evaluation_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Option C: Modern Alignment (DPO, ORPO, RLHF)
After SFT, alignment teaches the model which response is preferred. DPO (Direct Preference Optimization) has largely replaced classic RLHF for most production use cases — it's simpler, more stable, and doesn't need a separate reward model. ORPO goes further by combining SFT and preference alignment into a single training stage.
# DPO with TRL — simpler and more stable than RLHF
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
# Preference dataset: each row has prompt, chosen, rejected
preference_data = Dataset.from_dict({
"prompt": ["Patient with acute bronchitis, no fever..."],
"chosen": ["ICD-10: J20.9 (acute bronchitis, unspecified). CPT: 99213..."],
"rejected": ["The patient has bronchitis. Code it appropriately."],
})
model = AutoModelForCausalLM.from_pretrained("./medical-mistral-sft")
tokenizer = AutoTokenizer.from_pretrained("./medical-mistral-sft")
dpo_config = DPOConfig(
output_dir="./medical-mistral-dpo",
beta=0.1, # KL penalty: lower = more deviation allowed
num_train_epochs=1,
per_device_train_batch_size=2,
learning_rate=5e-7,
loss_type="sigmoid", # or "ipo", "kto_pair"
)
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=preference_data,
tokenizer=tokenizer,
)
trainer.train()
Choosing an alignment method:
| Method | When to use | Notes | |--------|-------------|-------| | SFT only | You have demonstrations, no preferences | Fast, simple, baseline | | DPO | You have pairwise preferences (chosen/rejected) | Default choice — stable, no reward model | | ORPO | You want to skip the SFT step | Combines SFT + preference into one pass | | KTO | You only have binary feedback (good/bad) | No need for paired data | | RLHF (PPO) | You need a reward model for online generation | Most complex; use when DPO underperforms | | RLAIF / Constitutional AI | You can't get human labels at scale | LLM-as-judge generates the preferences |
Modern toolchain we recommend:
- TRL (Hugging Face) — SFT, DPO, ORPO, KTO, PPO trainers in one library
- Axolotl — YAML-driven config for production fine-tuning runs
- Unsloth — 2-5× faster training with lower VRAM (great for QLoRA)
- LLaMA-Factory — broad model support, web UI for experimentation
- DeepSpeed / FSDP — multi-GPU sharding for larger models
Step 4: Evaluation & Testing
Never skip this. Fine-tuned models can regress on general tasks:
import openai
eval_prompts = {
"domain_task": "Translate this diagnosis to ICD-10...",
"general_reasoning": "Explain why the sky is blue...",
"math": "What is 15% of 240?",
"formatting": "Output valid JSON for..."
}
def query_model(model, prompt):
"""Helper to query any OpenAI model"""
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Test both base and fine-tuned model
for task, prompt in eval_prompts.items():
base_response = query_model("gpt-4o-mini", prompt)
finetuned_response = query_model(model_name, prompt) # model_name from training
print(f"{task}:")
print(f" Base: {base_response}")
print(f" Fine-tuned: {finetuned_response}")
Hybrid: Fine-Tuning + RAG
The best approach combines both:
import openai
from openai import OpenAI
client = OpenAI()
def retrieve_top_k(question, k=3):
"""Retrieve relevant documents using vector search"""
# Get query embedding
embedding = client.embeddings.create(
input=question,
model="text-embedding-ada-002"
).data[0].embedding
# Query vector database (example with hypothetical DB)
# results = vector_db.search(embedding, limit=k)
# return "\n\n".join([doc.content for doc in results])
# Placeholder for demo
return "Relevant medical coding guidelines from knowledge base..."
def hybrid_query(question, finetuned_model_id):
"""Fine-tuned model + RAG context"""
# 1. Retrieve relevant docs (RAG)
context = retrieve_top_k(question, k=3)
# 2. Query fine-tuned model with context
prompt = f"""Context:
{context}
Question: {question}
Provide a detailed answer using both the context and your medical coding expertise."""
response = openai.ChatCompletion.create(
model=finetuned_model_id, # e.g., "ft:gpt-4o-mini-2024-07-18:org::jobid"
messages=[
{"role": "system", "content": "You are a certified medical coder..."},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
Cost-Benefit Analysis
Fine-tuning costs (OpenAI GPT-4o-mini):
- Training: ~$0.025 per 1K tokens
- Inference: 3-4× base model cost
When it pays off:
- High-volume applications (>100K queries/month)
- Reduced prompt length (no need for huge few-shot examples)
- Improved accuracy on domain tasks (fewer retries)
Real-World Results
One client case study (healthcare):
- Before: RAG with GPT-4o (89% accuracy on medical coding)
- After: Fine-tuned GPT-4o-mini + RAG (96% accuracy)
- Bonus: 4× cost reduction by moving to mini model
- ROI: Paid back training costs in 2 weeks
Quick Start Checklist
- [ ] Collect 1000+ high-quality examples
- [ ] Validate data formatting and diversity
- [ ] Mix domain + general examples (80/20)
- [ ] Start with small model (GPT-4o-mini or Llama-7B)
- [ ] Evaluate on held-out test set
- [ ] Test for catastrophic forgetting
- [ ] Monitor production metrics
Need help fine-tuning for your domain? Let's talk.