16 Versioned Prompt Templates: How Cortex Manages AI Prompts at Scale
When you’re running AI agents in production, one question becomes critical: How do you version, test, and rollback your prompts?
After building Cortex—a multi-agent system managing GitHub repositories with 4 master agents and 9 specialized workers—I learned that treating prompts like infrastructure-as-code isn’t just best practice. It’s survival.
Here’s how we built a prompt engineering system that supports 16 versioned templates, A/B testing, and deterministic rollbacks.
The Problem: Prompts in Production are Code
Early in Cortex development, our prompts lived as hardcoded strings scattered across shell scripts:
# The bad old days
SYSTEM_PROMPT="You are a security scan worker that..."
This approach failed spectacularly when:
- A prompt change broke 3 workers simultaneously (no version control)
- We couldn’t A/B test improvements (no infrastructure for variants)
- Rolling back required Git archaeology (prompts buried in scripts)
- Template variables were inconsistent (copy-paste drift)
We needed prompts to be versioned artifacts with the same rigor as application code.
Architecture: Central Prompt Registry
Cortex’s prompt system consists of three core components:
1. Centralized Template Storage
All prompts live in a single directory structure:
coordination/prompts/
├── README.md (versioning guidelines)
├── masters/
│ ├── coordinator.md
│ ├── development.md
│ ├── security.md
│ └── inventory.md
├── workers/
│ ├── implementation-worker.md
│ ├── scan-worker.md
│ ├── fix-worker.md
│ ├── test-worker.md
│ ├── review-worker.md
│ ├── pr-worker.md
│ ├── documentation-worker.md
│ ├── analysis-worker.md
│ └── catalog-worker.md
└── orchestrator/
└── task-orchestrator.md
16 total templates: 4 masters + 9 workers + 1 orchestrator + 2 special purpose.
2. Semantic Versioning for Prompts
Each prompt follows semantic versioning (v{MAJOR}.{MINOR}.{PATCH}):
# Development Master Agent - System Prompt
**Agent Type**: Master Agent
**Version**: v2.1.0
**Last Updated**: 2025-11-27
**Token Budget**: 30,000 tokens
---
## Changelog
### v2.1.0 (2025-11-27)
- Added RAG context retrieval capabilities
- Enhanced worker spawning with context augmentation
- Improved error handling guidelines
### v2.0.0 (2025-11-15)
- BREAKING: Migrated to execution manager architecture
- Worker spawning now via EM for complex tasks
- Updated token budget allocation
### v1.0.0 (2025-11-01)
- Initial release
Version bump rules mirror software engineering:
- MAJOR: Breaking changes to agent behavior or interface
- MINOR: New features, backward-compatible capabilities
- PATCH: Bug fixes, clarifications, minor improvements
3. Template Variables
Prompts support runtime variable replacement:
Your worker ID is: {{WORKER_ID}}
Your task: {{TASK_DESCRIPTION}}
Token budget: {{TOKEN_BUDGET}}
Repository: {{REPOSITORY}}
Knowledge base: {{KNOWLEDGE_BASE_PATH}}
Scripts perform variable substitution at spawn time:
PROMPT_FILE="coordination/prompts/masters/development.md"
PROMPT_CONTENT=$(cat "$PROMPT_FILE")
# Replace template variables
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{MASTER_ID\}\}/$MASTER_ID}"
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{SESSION_ID\}\}/$SESSION_ID}"
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{TOKEN_BUDGET\}\}/$TOKEN_BUDGET}"
Worker specifications reference the template path:
{
"worker_id": "worker-impl-001",
"prompt_template": "coordination/prompts/workers/implementation-worker.md",
"prompt_variables": {
"WORKER_ID": "worker-impl-001",
"TASK_ID": "task-500",
"TOKEN_BUDGET": "10000",
"REPOSITORY": "ry-ops/cortex"
}
}
Version Management: The Prompt Registry
The prompt-manager.sh library provides version control APIs:
# Register new prompt version
register_prompt_version \
"implementation-worker" \
"v2.1.0" \
"$PROMPT_CONTENT" \
"Added RAG context retrieval"
# Get production version
production_version=$(get_production_version "implementation-worker")
# Activate a specific version
activate_prompt_version "implementation-worker-v2.1.0"
# Get prompt with A/B testing
prompt=$(get_prompt "scan-worker" --ab-test)
The registry tracks metadata in coordination/prompt-versions/registry.json:
{
"version": "1.0.0",
"prompts": {
"implementation-worker": [
{
"version_id": "v2.1.0",
"description": "Added RAG context retrieval",
"file_path": "coordination/prompt-versions/implementation-worker/v2.1.0.md",
"is_control": false,
"created_at": "2025-11-27T10:00:00Z",
"status": "active",
"metrics": {
"total_uses": 0,
"successes": 0,
"failures": 0,
"success_rate": 0
}
}
]
}
}
A/B Testing Infrastructure
The real power comes from testing prompt variants in production.
Creating an A/B Test
# Test two prompt versions for security scanning
test_id=$(start_ab_test \
"scan-worker" \
"v2.0.0" \ # Control
"v2.1.0" \ # Variant
30 \ # 30% traffic to variant
"RAG context experiment")
# Output: ab-1732723200-a4f2c8
Traffic Splitting
Cortex uses deterministic hashing for traffic assignment:
select_ab_version() {
local prompt_type="$1"
# Find active test
local active_test=$(jq -r --arg type "$prompt_type" '
.active_tests | to_entries[] |
select(.value.prompt_type == $type and .value.status == "running") |
.key
' "$PROMPT_AB_CONFIG_FILE" | head -1)
# Random selection based on traffic split
local traffic_split=$(echo "$test_config" | jq -r '.traffic_split_percent')
local random=$((RANDOM % 100))
if [ "$random" -lt "$traffic_split" ]; then
selected_version="$variant"
selection_group="variant"
else
selected_version="$control"
selection_group="control"
fi
# Return version and metadata
jq -nc \
--arg version "$selected_version" \
--arg test_id "$active_test" \
--arg group "$selection_group" \
'{version_id: $version, ab_test_id: $test_id, test_group: $group}'
}
Same task ID always gets the same variant (reproducibility).
Recording Outcomes
After each worker completes, we record success/failure:
record_prompt_outcome \
"$version_id" \
"success" \
"$prompt_type" \
"$ab_test_id" \
"$test_group"
Outcomes append to coordination/prompt-versions/outcomes.jsonl:
{"version_id":"v2.1.0","outcome":"success","prompt_type":"scan-worker","ab_test_id":"ab-1732723200-a4f2c8","test_group":"variant","recorded_at":"2025-11-27T15:30:00Z"}
{"version_id":"v2.0.0","outcome":"success","prompt_type":"scan-worker","ab_test_id":"ab-1732723200-a4f2c8","test_group":"control","recorded_at":"2025-11-27T15:31:00Z"}
Analyzing Results
After collecting 30+ samples per variant:
analyze_ab_test "$test_id"
Output:
{
"test_id": "ab-1732723200-a4f2c8",
"control": {
"uses": 45,
"successes": 41,
"success_rate": 0.9111
},
"variant": {
"uses": 38,
"successes": 36,
"success_rate": 0.9474
},
"improvement_percent": 3.98,
"winner": "variant",
"statistically_significant": true
}
Variant wins! 94.7% vs 91.1% success rate.
Auto-Promotion
With auto_promote: true in settings, winners activate automatically:
end_ab_test "$test_id" true
# Promotes variant to production
# Updates registry status
# Deprecates losing version
Real-World Example: Security Scanner Upgrade
In November 2024, we upgraded the security scan-worker prompt to include CVE context retrieval.
Hypothesis: Adding CVE database lookups improves vulnerability detection accuracy.
Implementation:
<!-- v2.0.0: Control -->
Run `npm audit` and report HIGH/CRITICAL vulnerabilities.
<!-- v2.1.0: Variant -->
Run `npm audit` and report HIGH/CRITICAL vulnerabilities.
For each CVE, retrieve detailed information from NIST NVD:
- CVSS score breakdown
- Attack vector details
- Exploitation likelihood
- Recommended remediation
Test Configuration:
- Traffic split: 70% control, 30% variant (conservative)
- Minimum samples: 30 per variant
- Success criteria: Scan completes without errors
- Quality metric: Number of actionable findings
Results after 72 hours:
| Metric | Control v2.0.0 | Variant v2.1.0 | Change |
|---|---|---|---|
| Success rate | 89.2% | 91.5% | +2.3% |
| Avg findings | 4.1 | 5.7 | +39% |
| False positives | 1.2 | 0.8 | -33% |
| Avg duration | 42s | 58s | +38% |
Decision: Promote v2.1.0 despite slower execution—quality improvement outweighs speed cost.
Rollback Capability
When v2.2.0 of the fix-worker caused workers to hang, rollback was instant:
# Identify problem
head -10 coordination/prompts/workers/fix-worker.md | grep Version
# Version: v2.2.0
# Check git history
git log --oneline coordination/prompts/workers/fix-worker.md
# Revert to v2.1.5
git checkout abc123 -- coordination/prompts/workers/fix-worker.md
# Update changelog
cat >> coordination/prompts/workers/fix-worker.md <<EOF
### v2.2.1 (2025-11-28)
- ROLLBACK: Reverted v2.2.0 due to worker timeouts
- Restored: v2.1.5 behavior (stable)
EOF
# Commit and redeploy
git add coordination/prompts/workers/fix-worker.md
git commit -m "rollback(prompts): fix-worker v2.2.1 - revert timeout issue"
Downtime: 4 minutes from detection to fix.
Best Practices
1. Single Responsibility Principle
Each prompt defines ONE agent role:
# Good
You are a scan-worker specialized in security vulnerability detection.
# Bad
You are a scan-worker that also fixes issues and writes documentation.
2. Explicit Success Criteria
Define measurable outcomes:
**Success Criteria:**
- All files scanned without errors
- Vulnerabilities reported in JSON format
- CVSS scores included for each finding
- Exit status 0 on completion
3. Bounded Scope
Clearly define limits:
**What you DO:**
- Scan dependencies for vulnerabilities
- Generate structured vulnerability report
- Suggest remediation steps
**What you DON'T do:**
- Apply fixes (that's fix-worker's job)
- Make deployment decisions (that's master's job)
- Modify code outside scan scope
4. Comprehensive Changelog
Track every change with context:
### v2.3.0 (2025-12-01)
- Added: Support for Python package scanning via `pip-audit`
- Changed: CVSS threshold from 7.0 to 5.0 (medium severity)
- Fixed: Parsing error with npm audit JSON output
- Deprecated: Legacy XML report format
5. Test Before Deploying
Use draft status for new versions:
# Create as draft
register_prompt_version \
"scan-worker" \
"v2.3.0" \
"$PROMPT_CONTENT" \
"Added Python scanning" \
false # Not control
# Test with specific version
get_prompt "scan-worker" "v2.3.0"
# Activate after validation
activate_prompt_version "scan-worker-v2.3.0"
Implementation Guide
Want to build a similar system? Here’s the blueprint:
Step 1: Centralize Prompts
Move all prompts to a single directory:
mkdir -p prompts/{masters,workers,special}
# Extract prompts from scripts
grep -r "SYSTEM_PROMPT=" scripts/ > prompts-to-migrate.txt
Step 2: Add Version Headers
Standardize format:
# {Agent Name} - System Prompt
**Agent Type**: Master | Worker
**Version**: v1.0.0
**Last Updated**: YYYY-MM-DD
**Token Budget**: N tokens
---
## Changelog
### v1.0.0 (YYYY-MM-DD)
- Initial release
Step 3: Build Prompt Registry
Create prompt-manager.sh:
register_prompt_version() {
local prompt_type="$1"
local version_id="$2"
local prompt_content="$3"
# Save prompt
echo "$prompt_content" > "prompts/${prompt_type}/${version_id}.md"
# Update registry.json
jq --arg type "$prompt_type" \
--arg version "$version_id" \
'.prompts[$type] += [{version_id: $version, status: "active"}]' \
registry.json > registry.tmp && mv registry.tmp registry.json
}
Step 4: Implement A/B Testing
Add traffic splitting:
select_ab_version() {
local prompt_type="$1"
local traffic_split=50
local random=$((RANDOM % 100))
if [ "$random" -lt "$traffic_split" ]; then
echo "variant"
else
echo "control"
fi
}
Step 5: Track Outcomes
Record every execution:
record_outcome() {
local version_id="$1"
local outcome="$2"
echo "{\"version\":\"$version_id\",\"outcome\":\"$outcome\",\"timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" \
>> outcomes.jsonl
}
Benefits Realized
After 3 months of production use:
Consistency
- 100% template compliance (automated validation)
- Zero variable naming drift (centralized definitions)
- Uniform changelog format (enforced by CI)
Testability
- 12 A/B tests run (8 promoted, 3 rolled back, 1 inconclusive)
- Average test duration: 48 hours
- Confidence threshold: 95% for promotion
Rollback Speed
- Average rollback time: 6 minutes (detection to fix)
- Longest rollback: 18 minutes (required emergency hotfix)
- Rollbacks performed: 5 in 3 months
Deployment Velocity
- Prompt changes per week: 2.3 (up from 0.4 before versioning)
- Failed deployments: 1.2% (down from 8.7%)
- Time to production: 2 hours (down from 2 days)
Key Takeaways
-
Treat prompts like infrastructure-as-code—version control, testing, and rollbacks are essential.
-
Semantic versioning works for prompts—MAJOR/MINOR/PATCH semantics map perfectly to prompt changes.
-
A/B testing de-risks improvements—gradual rollouts catch issues before full deployment.
-
Template variables reduce duplication—centralized definitions prevent copy-paste drift.
-
Changelogs are documentation—future you will thank past you for detailed change logs.
-
Automation enables velocity—manual prompt management doesn’t scale beyond 5 agents.
The full Cortex codebase is on GitHub at ry-ops/cortex. Check coordination/prompts/ for real production examples.
What’s your prompt versioning strategy? Reach out on Twitter/X or open an issue if you’re building something similar.
Next in series: Part 14: Multi-Agent Coordination Patterns - How 4 masters and 9 workers communicate via JSON specs.