Model Versioning in Production: Beyond Git for Machine Learning
1/20/2025
Git works great for code, but ML models need different versioning. A model checkpoint is more than code—it's trained on specific data, with hyperparameters, dependencies, and metrics that determine production readiness.
The Model Versioning Stack
Here's the four-layer approach we implement for clients:
1. Semantic Versioning for Models
# Not this:
model_v1.pkl
model_v2_final.pkl
model_v2_final_ACTUALLY_FINAL.pkl
# This:
fraud-detector-v2.3.1
# Major: Breaking API changes (input schema, output format)
# Minor: Accuracy improvements, new features
# Patch: Bug fixes, performance tuning
2. Model Registry with Metadata
Every model version stores:
{
"model_id": "fraud-detector-v2.3.1",
"training_data": "s3://data/fraud-2025-01-15.parquet",
"data_hash": "sha256:a3b2c1...",
"framework": "pytorch==2.1.0",
"metrics": {
"precision": 0.94,
"recall": 0.89,
"f1": 0.91,
"auc_roc": 0.96
},
"training_duration": "3h 24m",
"artifact_uri": "s3://models/fraud-detector/2.3.1/",
"stage": "production",
"promoted_at": "2025-01-20T10:00:00Z",
"promoted_by": "user@company.com"
}
3. Lineage Tracking
Connect models to their origins:
# Using MLflow or custom tracking
import mlflow
with mlflow.start_run():
# Log dataset version
mlflow.log_param("dataset_version", "2025-01-15")
mlflow.log_param("feature_engineering_commit", "abc123")
# Log hyperparameters
mlflow.log_params(config)
# Train model
model = train(data, config)
# Log metrics
mlflow.log_metrics(metrics)
# Log model with signature
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
registered_model_name="fraud-detector"
)
4. Staged Rollouts & Rollbacks
Never deploy directly to 100% traffic:
# Deployment strategy
stages:
- name: staging
traffic: 0%
purpose: Integration testing
- name: canary
traffic: 5%
duration: 2h
abort_on:
- latency_p95 > 200ms
- error_rate > 0.5%
- name: production
traffic: 100%
rollback_to: v2.2.4 # Previous stable version
Implementation Example
Here's a minimal model registry using DynamoDB:
import boto3
from datetime import datetime
from boto3.dynamodb.conditions import Attr
class ModelRegistry:
def __init__(self):
self.table = boto3.resource('dynamodb').Table('model-registry')
def register_model(self, model_id, metadata):
"""Register a new model version"""
item = {
'model_id': model_id,
'registered_at': datetime.utcnow().isoformat(),
'stage': 'staging',
**metadata
}
self.table.put_item(Item=item)
return model_id
def get_production_model(self):
"""Get currently deployed production model"""
response = self.table.scan(
FilterExpression=Attr('stage').eq('production')
)
return response['Items'][0] if response['Items'] else None
def promote_to_production(self, model_id):
"""Promote model to production stage"""
timestamp = datetime.utcnow().isoformat()
# Get current production model
current = self.get_production_model()
# Demote current to archived (preserve promoted_at for rollback)
if current:
self.table.update_item(
Key={'model_id': current['model_id']},
UpdateExpression='SET #s = :archived, archived_at = :time',
ExpressionAttributeNames={'#s': 'stage'},
ExpressionAttributeValues={
':archived': 'archived',
':time': timestamp
}
)
# Promote new model
self.table.update_item(
Key={'model_id': model_id},
UpdateExpression='SET #s = :prod, promoted_at = :time',
ExpressionAttributeNames={'#s': 'stage'},
ExpressionAttributeValues={
':prod': 'production',
':time': timestamp
}
)
def rollback(self, to_version=None):
"""Rollback to previous or specified version"""
if to_version:
self.promote_to_production(to_version)
else:
# Get last archived version (sorted by when it was promoted)
response = self.table.scan(
FilterExpression=Attr('stage').eq('archived')
)
if response['Items']:
# Sort by promoted_at (when it was last in production)
latest = sorted(
response['Items'],
key=lambda x: x.get('promoted_at', ''),
reverse=True
)[0]
self.promote_to_production(latest['model_id'])
Key Takeaways
- Version everything: Code, data, configs, dependencies
- Store metadata: Metrics, lineage, timestamps, authors
- Automate promotion: Manual approval + automated checks
- Plan for rollback: One-command revert to last known good
- Audit trail: Who promoted what, when, and why
For regulated industries (BFSI, healthcare), this isn't optional—it's compliance.
Want help setting up model versioning for your team? Get in touch.