Fine-Tuning LLMs for Domain Expertise: When and How to Do It Right
12/10/2024
The question we hear most: "Should we fine-tune or use RAG?"
The answer: Start with RAG. Fine-tune when RAG isn't enough.
When Fine-Tuning Makes Sense
Fine-tuning is worth it when:
- Domain-specific language - Medical abbreviations, legal jargon, industry acronyms
- Consistent formatting - Always output JSON, follow specific templates
- Behavioral adaptation - More concise, formal tone for healthcare vs casual for retail
- Proprietary knowledge - Internal processes, company-specific workflows
Don't fine-tune if:
- You have less than 1,000 high-quality examples
- Your knowledge changes weekly (use RAG)
- Prompt engineering + RAG gets you to 90%+ accuracy
The Fine-Tuning Pipeline
Step 1: Prepare Training Data
Quality > quantity. Format examples carefully:
# For instruction fine-tuning (OpenAI, Llama, Mistral)
training_data = [
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant..."},
{"role": "user", "content": "Patient presents with acute bronchitis..."},
{"role": "assistant", "content": "ICD-10: J20.9, CPT: 99213..."}
]
},
# ... 1000+ more examples
]
# Save as JSONL
import json
with open("training.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")
Step 2: Data Quality Checklist
def validate_training_data(examples):
"""Ensure high-quality fine-tuning data"""
issues = []
for i, ex in enumerate(examples):
# Check length (not too short, not too long)
user_msg = ex["messages"][1]["content"]
assistant_msg = ex["messages"][2]["content"]
if len(user_msg) < 10:
issues.append(f"Example {i}: User message too short")
if len(assistant_msg) < 10:
issues.append(f"Example {i}: Assistant message too short")
if len(user_msg) > 4000:
issues.append(f"Example {i}: User message too long")
# Check diversity (no duplicate inputs)
# Check formatting consistency
# Check that assistant doesn't refuse tasks
return issues
Pro tips:
- Mix 80% domain examples + 20% general examples (prevents forgetting)
- Include edge cases and error handling
- Validate formatting consistency across all examples
Step 3: Choose Your Approach
Option A: Full Fine-Tuning (OpenAI, Anthropic)
import openai
import time
# Upload training file
file = openai.File.create(
file=open("training.jsonl", "rb"),
purpose="fine-tune"
)
# Start fine-tuning job
job = openai.FineTuningJob.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3,
"batch_size": 1,
"learning_rate_multiplier": 1.8
}
)
# Monitor progress
while job.status != "succeeded":
job = openai.FineTuningJob.retrieve(job.id)
print(f"Status: {job.status}")
time.sleep(60)
model_name = job.fine_tuned_model
Option B: LoRA Fine-Tuning (Llama, Mistral, OSS models)
More efficient for larger models:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
load_in_8bit=True,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train (using Hugging Face Trainer)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./medical-mistral-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
evaluation_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Step 4: Evaluation & Testing
Never skip this. Fine-tuned models can regress on general tasks:
import openai
eval_prompts = {
"domain_task": "Translate this diagnosis to ICD-10...",
"general_reasoning": "Explain why the sky is blue...",
"math": "What is 15% of 240?",
"formatting": "Output valid JSON for..."
}
def query_model(model, prompt):
"""Helper to query any OpenAI model"""
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Test both base and fine-tuned model
for task, prompt in eval_prompts.items():
base_response = query_model("gpt-4o-mini", prompt)
finetuned_response = query_model(model_name, prompt) # model_name from training
print(f"{task}:")
print(f" Base: {base_response}")
print(f" Fine-tuned: {finetuned_response}")
Hybrid: Fine-Tuning + RAG
The best approach combines both:
import openai
from openai import OpenAI
client = OpenAI()
def retrieve_top_k(question, k=3):
"""Retrieve relevant documents using vector search"""
# Get query embedding
embedding = client.embeddings.create(
input=question,
model="text-embedding-ada-002"
).data[0].embedding
# Query vector database (example with hypothetical DB)
# results = vector_db.search(embedding, limit=k)
# return "\n\n".join([doc.content for doc in results])
# Placeholder for demo
return "Relevant medical coding guidelines from knowledge base..."
def hybrid_query(question, finetuned_model_id):
"""Fine-tuned model + RAG context"""
# 1. Retrieve relevant docs (RAG)
context = retrieve_top_k(question, k=3)
# 2. Query fine-tuned model with context
prompt = f"""Context:
{context}
Question: {question}
Provide a detailed answer using both the context and your medical coding expertise."""
response = openai.ChatCompletion.create(
model=finetuned_model_id, # e.g., "ft:gpt-4o-mini-2024-07-18:org::jobid"
messages=[
{"role": "system", "content": "You are a certified medical coder..."},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
Cost-Benefit Analysis
Fine-tuning costs (OpenAI GPT-4o-mini):
- Training: ~$0.025 per 1K tokens
- Inference: 3-4× base model cost
When it pays off:
- High-volume applications (>100K queries/month)
- Reduced prompt length (no need for huge few-shot examples)
- Improved accuracy on domain tasks (fewer retries)
Real-World Results
One client case study (healthcare):
- Before: RAG with GPT-4o (89% accuracy on medical coding)
- After: Fine-tuned GPT-4o-mini + RAG (96% accuracy)
- Bonus: 4× cost reduction by moving to mini model
- ROI: Paid back training costs in 2 weeks
Quick Start Checklist
- [ ] Collect 1000+ high-quality examples
- [ ] Validate data formatting and diversity
- [ ] Mix domain + general examples (80/20)
- [ ] Start with small model (GPT-4o-mini or Llama-7B)
- [ ] Evaluate on held-out test set
- [ ] Test for catastrophic forgetting
- [ ] Monitor production metrics
Need help fine-tuning for your domain? Let's talk.