Zero-Downtime Prompt Migrations: Evolving AI Templates Without Breaking Production
Zero-Downtime Prompt Migrations: Evolving AI Templates Without Breaking Production
You’ve just crafted the perfect prompt. It’s clearer, more focused, handles edge cases better. You’re excited to deploy it.
Then reality hits: You’re changing the brain of a production AI system. One bad prompt could cascade into failed tasks, confused agents, and broken workflows.
How do you evolve your AI templates without breaking production?
The Production Prompt Problem
Unlike traditional code changes, prompt modifications are uniquely challenging:
# Traditional code: Static, predictable behavior
function authenticate(user) {
return validateCredentials(user);
}
# AI prompt: Dynamic, emergent behavior
"You are a security master. Analyze this authentication flow..."
The difference?
- Code changes: Testable with unit tests, predictable outputs
- Prompt changes: Emergent behavior, context-dependent outputs, harder to validate
Real consequences we’ve seen:
- V1 → V2 security master: New prompt emphasized speed over thoroughness, missed 3 CVEs
- Worker template update: Changed phrasing caused 40% of agents to skip test execution
- Coordinator refinement: Better routing accuracy (85% → 92%) but 2x slower decision time
You need a migration strategy that’s as rigorous as database migrations.
Migration Strategy: Champion/Challenger/Shadow
Cortex uses a three-tier deployment pattern inspired by AWS and Google’s SRE practices:
┌─────────────────────────────────────────────────────────┐
│ Production Traffic │
└────────────┬──────────────┬────────────────┬────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Shadow │ │Challenger│ │Champion │
│ v2.0.0 │ │ v1.5.0 │ │ v1.0.0 │
├─────────┤ ├──────────┤ ├─────────┤
│ Monitor │ │ 10% Test │ │90% Prod │
│ Only │ │ Traffic │ │ Traffic │
│ │ │ │ │ │
│ Response│ │ Response │ │ Response│
│ Logged │ │ Used │ │ Used │
│ Dropped │ │ Metrics │ │ Primary │
└─────────┘ └──────────┘ └─────────┘
Tier 1: Shadow - Zero Risk Monitoring
Purpose: Test new prompts with production traffic, zero production impact
./scripts/promote-master.sh security deploy-shadow v2.0.0
What happens:
- New prompt receives copy of all production traffic
- Agent executes task with new prompt
- Response is logged but discarded
- Production uses existing champion version
Metrics to watch:
{"version":"v2.0.0","status":"success","duration_ms":1250,"quality":0.92}
{"version":"v2.0.0","status":"failed","error":"Timeout after 30s"}
{"version":"v2.0.0","status":"success","duration_ms":980,"quality":0.95}
Decision criteria:
- Error rate < 5% (vs champion’s baseline)
- Average quality score ≥ champion’s score
- Execution time within 150% of champion
- No critical failures (security issues, data loss)
Real example from Cortex:
# Shadow deployment: security-master v2.0.0
# 100 shadow executions over 4 hours
RESULTS:
Success rate: 94% (vs 96% champion)
Avg quality: 0.91 (vs 0.89 champion)
Avg duration: 45s (vs 38s champion)
Critical failures: 0
DECISION: ✓ Promote to Challenger
REASONING: Quality improved, latency acceptable (<20% slower)
Tier 2: Challenger - Controlled Canary
Purpose: Test with real production traffic, limited blast radius
./scripts/promote-master.sh security promote-to-challenger
Traffic split:
- Challenger (v2.0.0): 10% of production traffic
- Champion (v1.0.0): 90% of production traffic
Why 10%? Balances real-world validation with risk:
- Large enough to catch issues (30-50 tasks/hour typical)
- Small enough to limit damage if problems emerge
- Statistically significant for A/B testing
Monitoring during canary:
# Real-time metrics comparison
watch -n 5 'cat coordination/evaluation/results/challenger-metrics.json'
{
"window": "last_1_hour",
"challenger_v2.0.0": {
"tasks_assigned": 47,
"tasks_completed": 44,
"completion_rate": 93.6,
"avg_duration_sec": 42.3,
"avg_quality_score": 4.2,
"errors": 3
},
"champion_v1.0.0": {
"tasks_assigned": 423,
"tasks_completed": 406,
"completion_rate": 96.0,
"avg_duration_sec": 38.1,
"avg_quality_score": 4.1,
"errors": 17
},
"statistical_significance": true,
"winner": "challenger",
"confidence": 0.89
}
Real migration case: Development Master v1.5.0
We updated the development master prompt to better handle test coverage requirements:
# BEFORE (v1.0.0)
When implementing features, write tests to verify behavior.
# AFTER (v1.5.0)
Before marking implementation complete:
1. Write unit tests for all new functions (target: 80% coverage)
2. Write integration tests for API endpoints
3. Verify tests pass: `npm test`
4. Update test documentation in README
Canary results (48 hours):
Tasks completed: 127
Test coverage improvement: 65% → 84% average
Failures due to test issues: 3 (2.4%) - acceptable
Rollback triggers: 0
DECISION: ✓ Promote to Champion
Tier 3: Champion - Full Production
Purpose: Current production version serving majority of traffic
./scripts/promote-master.sh security promote-to-champion
What happens:
- Previous champion (v1.0.0) → archived
- Previous challenger (v2.0.0) → new champion
- Challenger slot → empty (ready for next canary)
- Version history logged for audit
{
"version_history": [
{
"version": "v2.0.0",
"promoted_at": "2025-12-17T10:30:00Z",
"status": "champion",
"description": "Enhanced security analysis with CVE cross-referencing"
},
{
"version": "v1.0.0",
"promoted_at": "2025-11-15T08:00:00Z",
"status": "archived",
"description": "Original security master baseline"
}
]
}
Backward Compatibility: The Safety Net
Semantic Versioning for Prompts
We treat prompts like APIs with semantic versioning:
v{MAJOR}.{MINOR}.{PATCH}
v1.0.0 → v1.1.0 ✓ Backward compatible (new capability)
v1.1.0 → v2.0.0 ⚠️ Breaking change (behavior shift)
v2.0.0 → v2.0.1 ✓ Patch (clarification, no behavior change)
MAJOR version (v1.x.x → v2.0.0):
- Changes agent core behavior
- Modifies output format
- Alters decision-making logic
- Requires: Full shadow → challenger → champion migration
MINOR version (v2.0.x → v2.1.0):
- Adds new capabilities
- Enhances existing features
- Backward compatible
- Requires: Shadow validation, can skip challenger for low-risk changes
PATCH version (v2.1.0 → v2.1.1):
- Fixes typos, clarifies language
- No behavioral change
- Requires: Review + deploy, can skip staging
Template Variable Compatibility
Cortex prompts use template variables replaced at runtime:
# Worker prompt template
You are {{WORKER_TYPE}} for task {{TASK_ID}}.
Your token budget: {{TOKEN_BUDGET}}
Repository: {{REPOSITORY}}
Migration rule: Never remove template variables in MINOR/PATCH versions.
# ✓ SAFE: Add new optional variable (v1.0.0 → v1.1.0)
Knowledge base: {{KNOWLEDGE_BASE_PATH}}
# ⚠️ BREAKING: Remove existing variable (v1.x.x → v2.0.0)
-Repository: {{REPOSITORY}}
# ✓ SAFE: Rename with fallback (v1.x.x → v1.1.0)
Repository: {{REPOSITORY}}{{REPO}} # Tries both
Fallback Mechanism
Every prompt load includes a fallback chain:
# scripts/lib/prompt-manager.sh
get_prompt() {
local prompt_type="$1"
# Try version-specific prompt
local versioned="coordination/prompts/${prompt_type}/${VERSION}.md"
if [ -f "$versioned" ]; then
cat "$versioned"
return
fi
# Fallback to latest
local latest="coordination/prompts/${prompt_type}/latest.md"
if [ -f "$latest" ]; then
cat "$latest"
return
fi
# Fallback to legacy location (backward compatibility)
local legacy="agents/prompts/${prompt_type}.md"
if [ -f "$legacy" ]; then
log_warning "Using legacy prompt path for ${prompt_type}"
cat "$legacy"
return
fi
log_error "No prompt found for ${prompt_type}"
exit 1
}
This ensures zero downtime even if version configuration is temporarily inconsistent.
Testing New Templates Before Rollout
Synthetic Evaluation with Golden Dataset
Before deploying to shadow, validate against known-good examples:
# Run evaluation against 50 golden examples
./evaluation/run-evaluation.sh --prompt security-master-v2.0.0
Golden dataset structure:
{
"eval_id": "security-001",
"task_description": "Analyze authentication flow for SQL injection vulnerabilities",
"expected_actions": [
"Scan SQL queries for parameterization",
"Check input validation on login endpoints",
"Review authentication token generation"
],
"expected_findings_count": 3,
"expected_quality_score": 0.90
}
Evaluation output:
╔════════════════════════════════════════════╗
║ Prompt Evaluation: security-master-v2.0.0 ║
╚════════════════════════════════════════════╝
Total Examples: 50
Passed: 46 (92%)
Failed: 4 (8%)
Avg Quality Score: 0.91 (baseline: 0.89)
Avg Execution Time: 42s (baseline: 38s)
VERDICT: ✓ Ready for shadow deployment
LM-as-Judge for Quality Assessment
For complex behavioral changes, use Claude itself to evaluate prompt quality:
# Evaluate routing quality with LM-as-Judge
./llm-mesh/moe-learning/moe-learn.sh eval
Judge prompt (simplified):
You are evaluating a routing decision.
TASK: {{TASK_DESCRIPTION}}
ACTUAL ROUTING:
Expert: {{ACTUAL_EXPERT}}
Confidence: {{ACTUAL_CONFIDENCE}}
IDEAL ROUTING:
Expert: {{IDEAL_EXPERT}}
Confidence: {{IDEAL_CONFIDENCE}}
Score the routing decision:
1. Routing Accuracy (0-100): Did it pick the right expert?
2. Confidence Calibration (0-100): Is confidence appropriate?
Provide actionable improvement suggestions.
Real evaluation result:
{
"eval_id": "eval-023",
"judgment": {
"routing_accuracy": 88,
"confidence_calibration": 92,
"is_optimal": true,
"analysis": {
"strengths": "Correctly identified security concern, appropriate confidence",
"weaknesses": "Could better explain why development was not chosen",
"learning_insights": [
"Add reasoning transparency to prompt",
"Include confidence justification in output"
]
}
}
}
A/B Testing for Incremental Changes
Compare two prompt versions statistically:
./scripts/lib/prompt-manager.sh start_ab_test \
"security-master" \
"v1.0.0" \
"v2.0.0" \
50 # 50/50 traffic split
Results after 100 tasks:
╔════════════════════════════════════════════════╗
║ A/B Test Results: security-master ║
╚════════════════════════════════════════════════╝
Metric v1.0.0 (Control) v2.0.0 (Variant) Delta
─────────────────────────────────────────────────────────────────────────
Tasks Completed 94 97 +3.2%
Avg Quality Score 4.1 4.4 +7.3%
Avg Duration (sec) 38 42 +10.5%
Critical Failures 2 1 -50.0%
Statistical Significance: ✓ (p < 0.05, n=50 per variant)
Winner: v2.0.0 (higher quality, acceptable latency increase)
Recommendation: Promote to champion
Monitoring During Migrations
Real-Time Anomaly Detection
Cortex continuously monitors for deviations during migrations:
# coordination/observability/alert-rules.json
{
"rule_id": "alert-001",
"name": "Critical Success Rate Drop",
"conditions": {
"anomaly_type": "success_rate_drop",
"severity": ["critical", "high"],
"min_deviation": 3.0 # 3 standard deviations
},
"channels": ["dashboard", "log", "webhook"],
"cooldown_minutes": 15,
"escalation": {
"enabled": true,
"escalate_after_minutes": 30,
"escalate_to": "on_call"
}
}
Trigger example:
⚠️ ALERT: Critical Success Rate Drop
Time: 2025-12-17 14:23:00
Version: security-master v2.0.0 (challenger)
Baseline: 96% success (rolling 1h avg)
Current: 78% success (last 15 min)
Deviation: 4.2σ
ACTIONS TAKEN:
1. Auto-rollback initiated
2. Webhook sent to on-call
3. Detailed logs captured: coordination/observability/incidents/incident-20251217-142300.jsonl
Key Metrics Dashboard
Track these metrics in real-time during migrations:
| Metric | Description | Threshold |
|---|---|---|
| Success Rate | % tasks completed without errors | < 95% → investigate |
| Quality Score | LM-as-Judge evaluation (1-5) | < 4.0 → investigate |
| Execution Time | Median task duration | > 150% baseline → investigate |
| Token Usage | Tokens per task | > 200% baseline → cost concern |
| Error Types | Distribution of failure modes | New error type → investigate |
Dashboard view:
┌─────────────────────────────────────────────────────────┐
│ Migration: security-master v1.0.0 → v2.0.0 (Challenger) │
├─────────────────────────────────────────────────────────┤
│ Success Rate: ██████████████████░░ 94% (-2%) │
│ Quality Score: ███████████████████░ 4.3 (+0.2) │
│ Execution Time: ████████████████░░░░ 42s (+11%) │
│ Token Usage: █████████████░░░░░░░ 8.2K (-5%) │
│ │
│ Tasks (Last 1h): 47 assigned, 44 completed │
│ Errors: 3 timeouts, 0 critical │
│ │
│ Status: ✓ HEALTHY - within acceptable ranges │
└─────────────────────────────────────────────────────────┘
Rollback Procedures When Things Go Wrong
Automatic Rollback Triggers
Cortex auto-rolls back when thresholds are breached:
# Automatic rollback conditions
if [[ $success_rate < 85 ]] || \
[[ $critical_failures > 0 ]] || \
[[ $deviation > 4.0 ]]; then
log_error "Rollback triggered: success_rate=${success_rate}%"
# Instant rollback
./scripts/promote-master.sh security rollback
# Alert team
send_alert "Auto-rollback: security-master v2.0.0 → v1.0.0"
# Capture diagnostic data
capture_incident_logs
fi
Manual Rollback
# One-command rollback
./scripts/promote-master.sh security rollback
# What happens:
# 1. Current champion → archived
# 2. Previous champion → restored to champion
# 3. Challenger → cleared
# 4. Version history updated
Rollback output:
Rolling back security-master...
Previous champion: v1.0.0
Current champion: v2.0.0
✓ Archived v2.0.0
✓ Restored v1.0.0 to champion
✓ Updated aliases.json
✓ Logged rollback to version history
SUCCESS: Rolled back to v1.0.0
Reason: Performance degradation detected
Next steps:
1. Review incident logs: coordination/observability/incidents/
2. Analyze v2.0.0 failures: coordination/evaluation/results/
3. Fix issues and re-test before re-deployment
Post-Rollback Analysis
After rolling back, understand what went wrong:
# Extract failed tasks for analysis
grep '"version":"v2.0.0"' coordination/logs/task-outcomes.jsonl | \
grep '"status":"failed"' > failed-tasks-v2.0.0.jsonl
# Analyze with LLM
cat failed-tasks-v2.0.0.jsonl | \
jq -s '{
total_failures: length,
error_types: group_by(.error_type) | map({type: .[0].error_type, count: length}),
common_patterns: map(.task_description) | unique
}'
Analysis output:
{
"total_failures": 23,
"error_types": [
{"type": "timeout", "count": 15},
{"type": "validation_error", "count": 5},
{"type": "quality_threshold_not_met", "count": 3}
],
"common_patterns": [
"tasks involving multi-file refactoring",
"tasks with >500 line code changes",
"tasks requiring external API integration"
]
}
Root cause identified:
ISSUE: v2.0.0 prompt included more detailed analysis steps
IMPACT: 40% increase in execution time → more timeouts
FIX: Simplify analysis for large codebases, add timeout handling
VALIDATION: Re-test v2.0.1 in shadow with timeout fixes
Real Migration Case Studies from Cortex
Case Study 1: Coordinator Master - Routing Logic Overhaul
Context: Migrating from keyword-based routing to confidence-calibrated routing with multi-expert consideration.
Version: v1.0.0 → v2.0.0 (MAJOR)
Changes:
# v1.0.0: Simple keyword matching
- Match task keywords to expert specializations
- Pick expert with highest keyword score
- Binary decision: route or escalate
# v2.0.0: Confidence-calibrated routing
+ Score all experts with confidence levels
+ Consider multi-expert coordination for complex tasks
+ Gradual confidence calibration based on task complexity
+ Explain routing decisions with reasoning
Migration timeline:
| Stage | Duration | Result |
|---|---|---|
| Shadow | 72 hours | 94% success, quality +0.05, latency +8% |
| Challenger | 5 days | 96% success, quality +0.12, latency +5% |
| Champion | Ongoing | 97% success, quality stable, latency normalized |
Key learning:
“The 3-day shadow period caught an edge case: tasks with ambiguous descriptions (e.g., ‘update docs’) had 40% lower confidence but same success rate. We adjusted confidence calibration before promoting to challenger.”
Case Study 2: Security Master - CVE Database Integration
Context: Adding real-time CVE cross-referencing to security scans.
Version: v1.2.0 → v1.3.0 (MINOR - new capability)
Changes:
# Added capability
+ Before reporting vulnerabilities, cross-reference with NVD CVE database
+ Include CVE IDs, CVSS scores, and patch availability
+ Link to vendor security advisories
Migration approach: Skipped shadow (backward compatible), straight to 10% challenger.
Challenger results:
Duration: 48 hours
Tasks: 89 (challenger), 801 (champion)
Success rate: 98% (vs 96% champion)
Quality improvement: +0.18 (4.1 → 4.28)
Latency: +12s (external API calls)
DECISION: Accepted latency increase (better quality)
PROMOTED: v1.3.0 → champion after 2 days
Case Study 3: Implementation Worker - Test Coverage Requirements
Context: Enforcing 80% test coverage requirement.
Version: v2.1.0 → v2.2.0 (MINOR)
Changes:
# Before
- Write tests for new features
# After
+ Verify test coverage ≥ 80% before marking complete
+ Run coverage report: `npm run coverage`
+ Update coverage badge in README
+ Document untested edge cases with justification
Migration timeline:
Shadow (4 hours):
- 12 tasks completed
- 3 failures: Tasks didn’t have coverage tools configured
- Issue found: Assumption that all repos have coverage configured
Fix:
# v2.2.1 (PATCH)
+ Check if coverage tools available: `npm run coverage --version`
+ If not available, log warning and continue without coverage check
+ Include setup instructions in failure message
Challenger (24 hours):
- 64 tasks completed
- Success rate: 97%
- Coverage improvement: 65% avg → 83% avg
- Promoted to champion
Rollback event (Day 7):
Incident: Worker kept failing on monorepo packages
Cause: Coverage tool ran against entire monorepo (>500 packages)
Impact: 18 timeouts in 2 hours
Action: Auto-rollback to v2.1.0
Resolution: Added monorepo detection in v2.2.2, re-deployed
Best Practices and Common Pitfalls
Best Practices
1. Always use progressive deployment
# ✓ CORRECT
deploy-shadow → validate → promote-to-challenger → validate → promote-to-champion
# ✗ WRONG
vim prompt.md && deploy-to-production # YOLO deployment
2. Validate with golden dataset before shadow
# Catch obvious issues before production traffic
./evaluation/run-evaluation.sh --prompt new-prompt-v2.0.0
3. Monitor for 2-3x the expected task duration
# If average task = 30s, monitor challenger for 60-90s before deciding
# Catches issues that emerge only in longer-running tasks
4. Document rollback criteria upfront
## Rollback Criteria for v2.0.0 Deployment
- Success rate < 90% (baseline: 96%)
- Any critical security failures
- Quality score < 3.8 (baseline: 4.1)
- >5 tasks timeout in 1-hour window
5. Use semantic versioning strictly
# Changes behavior? Bump MAJOR
git commit -m "feat!: change routing algorithm (v1.x → v2.0)"
# Adds capability? Bump MINOR
git commit -m "feat: add CVE cross-referencing (v1.2 → v1.3)"
# Fixes/clarifies? Bump PATCH
git commit -m "fix: clarify timeout handling instructions (v1.3.0 → v1.3.1)"
Common Pitfalls
Pitfall 1: Skipping shadow deployment
# "It's just a typo fix, I'll deploy directly"
# Result: Typo was in critical conditional logic → 40% failure rate
Learning: Even “trivial” changes go through shadow. Takes 15 minutes, prevents disasters.
Pitfall 2: Insufficient sample size in challenger
# Promoted after 5 challenger tasks (all successful)
# Result: Real issue appeared at task 23 (edge case)
Learning: Wait for 30-50 tasks minimum or 24 hours, whichever is longer.
Pitfall 3: Ignoring latency increases
# "Quality improved by 15%, latency only increased 25%"
# Result: Timeout rate jumped from 2% → 18%
Learning: Set hard latency limits. If task timeout = 60s, max latency = 45s.
Pitfall 4: No automated rollback
# Manual monitoring, noticed issue 2 hours after deployment
# Result: 247 failed tasks before manual rollback
Learning: Automated rollback triggers saved 200+ failures in subsequent deployments.
Pitfall 5: Changing multiple things at once
# v2.0.0: New routing algorithm + new output format + new timeout handling
# Result: Can't isolate which change caused issues
Learning: One major change per version. Multi-change releases → branch testing.
The Migration Workflow: Complete Example
Here’s a real end-to-end migration from Cortex:
Goal: Improve development-master prompt to handle monorepo complexity better.
Step 1: Develop new prompt (v1.8.0 → v1.9.0)
# v1.9.0 additions
+ Detect monorepo structure: lerna.json, pnpm-workspace.yaml
+ Identify affected packages from file changes
+ Run tests only for affected packages
+ Document cross-package dependencies in PR description
Step 2: Validate with golden dataset
./evaluation/run-evaluation.sh --prompt development-master-v1.9.0
# Results: 47/50 passed (94%), ready for shadow
Step 3: Deploy to shadow
./scripts/promote-master.sh development deploy-shadow v1.9.0
# Monitor for 6 hours (2x average task duration)
tail -f coordination/observability/logs/shadow-v1.9.0.jsonl
Shadow results:
Tasks processed: 87
Success rate: 95% (baseline: 94%)
Quality: 4.3 (baseline: 4.2)
Latency: 48s (baseline: 52s)
Issues: 4 monorepo detection false positives
DECISION: Fix false positives before promoting
Step 4: Patch and re-shadow (v1.9.1)
# Fixed detection logic
./scripts/promote-master.sh development deploy-shadow v1.9.1
# 3 hours monitoring
# Results: 98% success, 0 detection issues
Step 5: Promote to challenger
./scripts/promote-master.sh development promote-to-challenger
# 10% traffic for 48 hours
Challenger results:
Tasks: 124 (challenger), 1116 (champion)
Success: 97% (champion: 94%)
Quality: 4.4 (champion: 4.2)
Test coverage: 86% (champion: 78%)
DECISION: ✓ Promote to champion
Step 6: Promote to champion
./scripts/promote-master.sh development promote-to-champion
# Update version history
# Continue monitoring for 7 days
Week 1 results:
Total tasks: 1,847
Success rate: 96% (+2% from v1.8.0)
Monorepo handling: 100% accurate
Rollback events: 0
OUTCOME: ✓ Successful migration
Conclusion: Migration as a Discipline
Prompt migrations aren’t just “editing text files.” They’re behavior changes to a production AI system.
The discipline:
- Test before deploying: Golden dataset validation
- Deploy progressively: Shadow → Challenger → Champion
- Monitor continuously: Real-time anomaly detection
- Rollback automatically: Don’t wait for disasters
- Learn from failures: Post-rollback analysis
The result: Zero-downtime evolution of your AI system.
In Cortex, we’ve run 67 prompt migrations over 8 months:
- 61 successful (91%)
- 6 rolled back (9%)
- 0 production outages
The infrastructure is built. The processes are tested. You can safely evolve your AI.
Tomorrow: Building Confidence Calibration - Teaching AI When It’s Unsure
Related Posts: