Skip to content

Fine-Tuning LLMs for Domain Expertise: When and How to Do It Right

12/10/2024

The question we hear most: "Should we fine-tune or use RAG?"

The answer: Start with RAG. Fine-tune when RAG isn't enough.

When Fine-Tuning Makes Sense

Fine-tuning is worth it when:

  1. Domain-specific language - Medical abbreviations, legal jargon, industry acronyms
  2. Consistent formatting - Always output JSON, follow specific templates
  3. Behavioral adaptation - More concise, formal tone for healthcare vs casual for retail
  4. Proprietary knowledge - Internal processes, company-specific workflows

Don't fine-tune if:

The Fine-Tuning Pipeline

Step 1: Prepare Training Data

Quality > quantity. Format examples carefully:

# For instruction fine-tuning (OpenAI, Llama, Mistral)
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a medical coding assistant..."},
            {"role": "user", "content": "Patient presents with acute bronchitis..."},
            {"role": "assistant", "content": "ICD-10: J20.9, CPT: 99213..."}
        ]
    },
    # ... 1000+ more examples
]

# Save as JSONL
import json
with open("training.jsonl", "w") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

Step 2: Data Quality Checklist

def validate_training_data(examples):
    """Ensure high-quality fine-tuning data"""
    issues = []
    
    for i, ex in enumerate(examples):
        # Check length (not too short, not too long)
        user_msg = ex["messages"][1]["content"]
        assistant_msg = ex["messages"][2]["content"]
        
        if len(user_msg) < 10:
            issues.append(f"Example {i}: User message too short")
        if len(assistant_msg) < 10:
            issues.append(f"Example {i}: Assistant message too short")
        if len(user_msg) > 4000:
            issues.append(f"Example {i}: User message too long")
        
        # Check diversity (no duplicate inputs)
        # Check formatting consistency
        # Check that assistant doesn't refuse tasks
        
    return issues

Pro tips:

Step 3: Choose Your Approach

Option A: Full Fine-Tuning (OpenAI, Anthropic)

import openai
import time

# Upload training file
file = openai.File.create(
    file=open("training.jsonl", "rb"),
    purpose="fine-tune"
)

# Start fine-tuning job
job = openai.FineTuningJob.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 1,
        "learning_rate_multiplier": 1.8
    }
)

# Monitor progress
while job.status != "succeeded":
    job = openai.FineTuningJob.retrieve(job.id)
    print(f"Status: {job.status}")
    time.sleep(60)

model_name = job.fine_tuned_model

Option B: LoRA Fine-Tuning (Llama, Mistral, OSS models)

More efficient for larger models:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_8bit=True,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Train (using Hugging Face Trainer)
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./medical-mistral-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Option C: Modern Alignment (DPO, ORPO, RLHF)

After SFT, alignment teaches the model which response is preferred. DPO (Direct Preference Optimization) has largely replaced classic RLHF for most production use cases — it's simpler, more stable, and doesn't need a separate reward model. ORPO goes further by combining SFT and preference alignment into a single training stage.

# DPO with TRL — simpler and more stable than RLHF
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset

# Preference dataset: each row has prompt, chosen, rejected
preference_data = Dataset.from_dict({
    "prompt": ["Patient with acute bronchitis, no fever..."],
    "chosen": ["ICD-10: J20.9 (acute bronchitis, unspecified). CPT: 99213..."],
    "rejected": ["The patient has bronchitis. Code it appropriately."],
})

model = AutoModelForCausalLM.from_pretrained("./medical-mistral-sft")
tokenizer = AutoTokenizer.from_pretrained("./medical-mistral-sft")

dpo_config = DPOConfig(
    output_dir="./medical-mistral-dpo",
    beta=0.1,                    # KL penalty: lower = more deviation allowed
    num_train_epochs=1,
    per_device_train_batch_size=2,
    learning_rate=5e-7,
    loss_type="sigmoid",         # or "ipo", "kto_pair"
)

trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer,
)
trainer.train()

Choosing an alignment method:

| Method | When to use | Notes | |--------|-------------|-------| | SFT only | You have demonstrations, no preferences | Fast, simple, baseline | | DPO | You have pairwise preferences (chosen/rejected) | Default choice — stable, no reward model | | ORPO | You want to skip the SFT step | Combines SFT + preference into one pass | | KTO | You only have binary feedback (good/bad) | No need for paired data | | RLHF (PPO) | You need a reward model for online generation | Most complex; use when DPO underperforms | | RLAIF / Constitutional AI | You can't get human labels at scale | LLM-as-judge generates the preferences |

Modern toolchain we recommend:

Step 4: Evaluation & Testing

Never skip this. Fine-tuned models can regress on general tasks:

import openai

eval_prompts = {
    "domain_task": "Translate this diagnosis to ICD-10...",
    "general_reasoning": "Explain why the sky is blue...",
    "math": "What is 15% of 240?",
    "formatting": "Output valid JSON for..."
}

def query_model(model, prompt):
    """Helper to query any OpenAI model"""
    response = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Test both base and fine-tuned model
for task, prompt in eval_prompts.items():
    base_response = query_model("gpt-4o-mini", prompt)
    finetuned_response = query_model(model_name, prompt)  # model_name from training
    
    print(f"{task}:")
    print(f"  Base: {base_response}")
    print(f"  Fine-tuned: {finetuned_response}")

Hybrid: Fine-Tuning + RAG

The best approach combines both:

import openai
from openai import OpenAI

client = OpenAI()

def retrieve_top_k(question, k=3):
    """Retrieve relevant documents using vector search"""
    # Get query embedding
    embedding = client.embeddings.create(
        input=question,
        model="text-embedding-ada-002"
    ).data[0].embedding
    
    # Query vector database (example with hypothetical DB)
    # results = vector_db.search(embedding, limit=k)
    # return "\n\n".join([doc.content for doc in results])
    
    # Placeholder for demo
    return "Relevant medical coding guidelines from knowledge base..."

def hybrid_query(question, finetuned_model_id):
    """Fine-tuned model + RAG context"""
    
    # 1. Retrieve relevant docs (RAG)
    context = retrieve_top_k(question, k=3)
    
    # 2. Query fine-tuned model with context
    prompt = f"""Context:
{context}

Question: {question}

Provide a detailed answer using both the context and your medical coding expertise."""
    
    response = openai.ChatCompletion.create(
        model=finetuned_model_id,  # e.g., "ft:gpt-4o-mini-2024-07-18:org::jobid"
        messages=[
            {"role": "system", "content": "You are a certified medical coder..."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content

Cost-Benefit Analysis

Fine-tuning costs (OpenAI GPT-4o-mini):

When it pays off:

Real-World Results

One client case study (healthcare):

Quick Start Checklist

Need help fine-tuning for your domain? Let's talk.