Skip to main content

16 Versioned Prompt Templates: How Cortex Manages AI Prompts at Scale

Ryan Dahlberg
Ryan Dahlberg
December 10, 2025 9 min read
Share:
16 Versioned Prompt Templates: How Cortex Manages AI Prompts at Scale

When you’re running AI agents in production, one question becomes critical: How do you version, test, and rollback your prompts?

After building Cortex—a multi-agent system managing GitHub repositories with 4 master agents and 9 specialized workers—I learned that treating prompts like infrastructure-as-code isn’t just best practice. It’s survival.

Here’s how we built a prompt engineering system that supports 16 versioned templates, A/B testing, and deterministic rollbacks.

The Problem: Prompts in Production are Code

Early in Cortex development, our prompts lived as hardcoded strings scattered across shell scripts:

# The bad old days
SYSTEM_PROMPT="You are a security scan worker that..."

This approach failed spectacularly when:

  1. A prompt change broke 3 workers simultaneously (no version control)
  2. We couldn’t A/B test improvements (no infrastructure for variants)
  3. Rolling back required Git archaeology (prompts buried in scripts)
  4. Template variables were inconsistent (copy-paste drift)

We needed prompts to be versioned artifacts with the same rigor as application code.

Architecture: Central Prompt Registry

Cortex’s prompt system consists of three core components:

1. Centralized Template Storage

All prompts live in a single directory structure:

coordination/prompts/
├── README.md (versioning guidelines)
├── masters/
│   ├── coordinator.md
│   ├── development.md
│   ├── security.md
│   └── inventory.md
├── workers/
│   ├── implementation-worker.md
│   ├── scan-worker.md
│   ├── fix-worker.md
│   ├── test-worker.md
│   ├── review-worker.md
│   ├── pr-worker.md
│   ├── documentation-worker.md
│   ├── analysis-worker.md
│   └── catalog-worker.md
└── orchestrator/
    └── task-orchestrator.md

16 total templates: 4 masters + 9 workers + 1 orchestrator + 2 special purpose.

2. Semantic Versioning for Prompts

Each prompt follows semantic versioning (v{MAJOR}.{MINOR}.{PATCH}):

# Development Master Agent - System Prompt

**Agent Type**: Master Agent
**Version**: v2.1.0
**Last Updated**: 2025-11-27
**Token Budget**: 30,000 tokens

---

## Changelog

### v2.1.0 (2025-11-27)
- Added RAG context retrieval capabilities
- Enhanced worker spawning with context augmentation
- Improved error handling guidelines

### v2.0.0 (2025-11-15)
- BREAKING: Migrated to execution manager architecture
- Worker spawning now via EM for complex tasks
- Updated token budget allocation

### v1.0.0 (2025-11-01)
- Initial release

Version bump rules mirror software engineering:

  • MAJOR: Breaking changes to agent behavior or interface
  • MINOR: New features, backward-compatible capabilities
  • PATCH: Bug fixes, clarifications, minor improvements

3. Template Variables

Prompts support runtime variable replacement:

Your worker ID is: {{WORKER_ID}}
Your task: {{TASK_DESCRIPTION}}
Token budget: {{TOKEN_BUDGET}}
Repository: {{REPOSITORY}}
Knowledge base: {{KNOWLEDGE_BASE_PATH}}

Scripts perform variable substitution at spawn time:

PROMPT_FILE="coordination/prompts/masters/development.md"
PROMPT_CONTENT=$(cat "$PROMPT_FILE")

# Replace template variables
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{MASTER_ID\}\}/$MASTER_ID}"
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{SESSION_ID\}\}/$SESSION_ID}"
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{TOKEN_BUDGET\}\}/$TOKEN_BUDGET}"

Worker specifications reference the template path:

{
  "worker_id": "worker-impl-001",
  "prompt_template": "coordination/prompts/workers/implementation-worker.md",
  "prompt_variables": {
    "WORKER_ID": "worker-impl-001",
    "TASK_ID": "task-500",
    "TOKEN_BUDGET": "10000",
    "REPOSITORY": "ry-ops/cortex"
  }
}

Version Management: The Prompt Registry

The prompt-manager.sh library provides version control APIs:

# Register new prompt version
register_prompt_version \
    "implementation-worker" \
    "v2.1.0" \
    "$PROMPT_CONTENT" \
    "Added RAG context retrieval"

# Get production version
production_version=$(get_production_version "implementation-worker")

# Activate a specific version
activate_prompt_version "implementation-worker-v2.1.0"

# Get prompt with A/B testing
prompt=$(get_prompt "scan-worker" --ab-test)

The registry tracks metadata in coordination/prompt-versions/registry.json:

{
  "version": "1.0.0",
  "prompts": {
    "implementation-worker": [
      {
        "version_id": "v2.1.0",
        "description": "Added RAG context retrieval",
        "file_path": "coordination/prompt-versions/implementation-worker/v2.1.0.md",
        "is_control": false,
        "created_at": "2025-11-27T10:00:00Z",
        "status": "active",
        "metrics": {
          "total_uses": 0,
          "successes": 0,
          "failures": 0,
          "success_rate": 0
        }
      }
    ]
  }
}

A/B Testing Infrastructure

The real power comes from testing prompt variants in production.

Creating an A/B Test

# Test two prompt versions for security scanning
test_id=$(start_ab_test \
    "scan-worker" \
    "v2.0.0" \      # Control
    "v2.1.0" \      # Variant
    30 \            # 30% traffic to variant
    "RAG context experiment")

# Output: ab-1732723200-a4f2c8

Traffic Splitting

Cortex uses deterministic hashing for traffic assignment:

select_ab_version() {
    local prompt_type="$1"

    # Find active test
    local active_test=$(jq -r --arg type "$prompt_type" '
        .active_tests | to_entries[] |
        select(.value.prompt_type == $type and .value.status == "running") |
        .key
    ' "$PROMPT_AB_CONFIG_FILE" | head -1)

    # Random selection based on traffic split
    local traffic_split=$(echo "$test_config" | jq -r '.traffic_split_percent')
    local random=$((RANDOM % 100))

    if [ "$random" -lt "$traffic_split" ]; then
        selected_version="$variant"
        selection_group="variant"
    else
        selected_version="$control"
        selection_group="control"
    fi

    # Return version and metadata
    jq -nc \
        --arg version "$selected_version" \
        --arg test_id "$active_test" \
        --arg group "$selection_group" \
        '{version_id: $version, ab_test_id: $test_id, test_group: $group}'
}

Same task ID always gets the same variant (reproducibility).

Recording Outcomes

After each worker completes, we record success/failure:

record_prompt_outcome \
    "$version_id" \
    "success" \
    "$prompt_type" \
    "$ab_test_id" \
    "$test_group"

Outcomes append to coordination/prompt-versions/outcomes.jsonl:

{"version_id":"v2.1.0","outcome":"success","prompt_type":"scan-worker","ab_test_id":"ab-1732723200-a4f2c8","test_group":"variant","recorded_at":"2025-11-27T15:30:00Z"}
{"version_id":"v2.0.0","outcome":"success","prompt_type":"scan-worker","ab_test_id":"ab-1732723200-a4f2c8","test_group":"control","recorded_at":"2025-11-27T15:31:00Z"}

Analyzing Results

After collecting 30+ samples per variant:

analyze_ab_test "$test_id"

Output:

{
  "test_id": "ab-1732723200-a4f2c8",
  "control": {
    "uses": 45,
    "successes": 41,
    "success_rate": 0.9111
  },
  "variant": {
    "uses": 38,
    "successes": 36,
    "success_rate": 0.9474
  },
  "improvement_percent": 3.98,
  "winner": "variant",
  "statistically_significant": true
}

Variant wins! 94.7% vs 91.1% success rate.

Auto-Promotion

With auto_promote: true in settings, winners activate automatically:

end_ab_test "$test_id" true

# Promotes variant to production
# Updates registry status
# Deprecates losing version

Real-World Example: Security Scanner Upgrade

In November 2024, we upgraded the security scan-worker prompt to include CVE context retrieval.

Hypothesis: Adding CVE database lookups improves vulnerability detection accuracy.

Implementation:

<!-- v2.0.0: Control -->
Run `npm audit` and report HIGH/CRITICAL vulnerabilities.

<!-- v2.1.0: Variant -->
Run `npm audit` and report HIGH/CRITICAL vulnerabilities.
For each CVE, retrieve detailed information from NIST NVD:
- CVSS score breakdown
- Attack vector details
- Exploitation likelihood
- Recommended remediation

Test Configuration:

  • Traffic split: 70% control, 30% variant (conservative)
  • Minimum samples: 30 per variant
  • Success criteria: Scan completes without errors
  • Quality metric: Number of actionable findings

Results after 72 hours:

MetricControl v2.0.0Variant v2.1.0Change
Success rate89.2%91.5%+2.3%
Avg findings4.15.7+39%
False positives1.20.8-33%
Avg duration42s58s+38%

Decision: Promote v2.1.0 despite slower execution—quality improvement outweighs speed cost.

Rollback Capability

When v2.2.0 of the fix-worker caused workers to hang, rollback was instant:

# Identify problem
head -10 coordination/prompts/workers/fix-worker.md | grep Version
# Version: v2.2.0

# Check git history
git log --oneline coordination/prompts/workers/fix-worker.md

# Revert to v2.1.5
git checkout abc123 -- coordination/prompts/workers/fix-worker.md

# Update changelog
cat >> coordination/prompts/workers/fix-worker.md <<EOF

### v2.2.1 (2025-11-28)
- ROLLBACK: Reverted v2.2.0 due to worker timeouts
- Restored: v2.1.5 behavior (stable)
EOF

# Commit and redeploy
git add coordination/prompts/workers/fix-worker.md
git commit -m "rollback(prompts): fix-worker v2.2.1 - revert timeout issue"

Downtime: 4 minutes from detection to fix.

Best Practices

1. Single Responsibility Principle

Each prompt defines ONE agent role:

# Good
You are a scan-worker specialized in security vulnerability detection.

# Bad
You are a scan-worker that also fixes issues and writes documentation.

2. Explicit Success Criteria

Define measurable outcomes:

**Success Criteria:**
- All files scanned without errors
- Vulnerabilities reported in JSON format
- CVSS scores included for each finding
- Exit status 0 on completion

3. Bounded Scope

Clearly define limits:

**What you DO:**
- Scan dependencies for vulnerabilities
- Generate structured vulnerability report
- Suggest remediation steps

**What you DON'T do:**
- Apply fixes (that's fix-worker's job)
- Make deployment decisions (that's master's job)
- Modify code outside scan scope

4. Comprehensive Changelog

Track every change with context:

### v2.3.0 (2025-12-01)
- Added: Support for Python package scanning via `pip-audit`
- Changed: CVSS threshold from 7.0 to 5.0 (medium severity)
- Fixed: Parsing error with npm audit JSON output
- Deprecated: Legacy XML report format

5. Test Before Deploying

Use draft status for new versions:

# Create as draft
register_prompt_version \
    "scan-worker" \
    "v2.3.0" \
    "$PROMPT_CONTENT" \
    "Added Python scanning" \
    false  # Not control

# Test with specific version
get_prompt "scan-worker" "v2.3.0"

# Activate after validation
activate_prompt_version "scan-worker-v2.3.0"

Implementation Guide

Want to build a similar system? Here’s the blueprint:

Step 1: Centralize Prompts

Move all prompts to a single directory:

mkdir -p prompts/{masters,workers,special}
# Extract prompts from scripts
grep -r "SYSTEM_PROMPT=" scripts/ > prompts-to-migrate.txt

Step 2: Add Version Headers

Standardize format:

# {Agent Name} - System Prompt

**Agent Type**: Master | Worker
**Version**: v1.0.0
**Last Updated**: YYYY-MM-DD
**Token Budget**: N tokens

---

## Changelog

### v1.0.0 (YYYY-MM-DD)
- Initial release

Step 3: Build Prompt Registry

Create prompt-manager.sh:

register_prompt_version() {
    local prompt_type="$1"
    local version_id="$2"
    local prompt_content="$3"

    # Save prompt
    echo "$prompt_content" > "prompts/${prompt_type}/${version_id}.md"

    # Update registry.json
    jq --arg type "$prompt_type" \
       --arg version "$version_id" \
       '.prompts[$type] += [{version_id: $version, status: "active"}]' \
       registry.json > registry.tmp && mv registry.tmp registry.json
}

Step 4: Implement A/B Testing

Add traffic splitting:

select_ab_version() {
    local prompt_type="$1"
    local traffic_split=50
    local random=$((RANDOM % 100))

    if [ "$random" -lt "$traffic_split" ]; then
        echo "variant"
    else
        echo "control"
    fi
}

Step 5: Track Outcomes

Record every execution:

record_outcome() {
    local version_id="$1"
    local outcome="$2"

    echo "{\"version\":\"$version_id\",\"outcome\":\"$outcome\",\"timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" \
        >> outcomes.jsonl
}

Benefits Realized

After 3 months of production use:

Consistency

  • 100% template compliance (automated validation)
  • Zero variable naming drift (centralized definitions)
  • Uniform changelog format (enforced by CI)

Testability

  • 12 A/B tests run (8 promoted, 3 rolled back, 1 inconclusive)
  • Average test duration: 48 hours
  • Confidence threshold: 95% for promotion

Rollback Speed

  • Average rollback time: 6 minutes (detection to fix)
  • Longest rollback: 18 minutes (required emergency hotfix)
  • Rollbacks performed: 5 in 3 months

Deployment Velocity

  • Prompt changes per week: 2.3 (up from 0.4 before versioning)
  • Failed deployments: 1.2% (down from 8.7%)
  • Time to production: 2 hours (down from 2 days)

Key Takeaways

  1. Treat prompts like infrastructure-as-code—version control, testing, and rollbacks are essential.

  2. Semantic versioning works for prompts—MAJOR/MINOR/PATCH semantics map perfectly to prompt changes.

  3. A/B testing de-risks improvements—gradual rollouts catch issues before full deployment.

  4. Template variables reduce duplication—centralized definitions prevent copy-paste drift.

  5. Changelogs are documentation—future you will thank past you for detailed change logs.

  6. Automation enables velocity—manual prompt management doesn’t scale beyond 5 agents.

The full Cortex codebase is on GitHub at ry-ops/cortex. Check coordination/prompts/ for real production examples.

What’s your prompt versioning strategy? Reach out on Twitter/X or open an issue if you’re building something similar.


Next in series: Part 14: Multi-Agent Coordination Patterns - How 4 masters and 9 workers communicate via JSON specs.

#Cortex #Prompt Engineering #AI #Production