16 Versioned Prompt Templates: How Cortex Manages AI Prompts at Scale

Ryan Dahlberg

December 10, 2025 | 9 minutes

When you’re running AI agents in production, one question becomes critical: How do you version, test, and rollback your prompts?

After building Cortex—a multi-agent system managing GitHub repositories with 4 master agents and 9 specialized workers—I learned that treating prompts like infrastructure-as-code isn’t just best practice. It’s survival.

Here’s how we built a prompt engineering system that supports 16 versioned templates, A/B testing, and deterministic rollbacks.

The Problem: Prompts in Production are Code

Early in Cortex development, our prompts lived as hardcoded strings scattered across shell scripts:

# The bad old days
SYSTEM_PROMPT="You are a security scan worker that..."

This approach failed spectacularly when:

A prompt change broke 3 workers simultaneously (no version control)
We couldn’t A/B test improvements (no infrastructure for variants)
Rolling back required Git archaeology (prompts buried in scripts)
Template variables were inconsistent (copy-paste drift)

We needed prompts to be versioned artifacts with the same rigor as application code.

Architecture: Central Prompt Registry

Cortex’s prompt system consists of three core components:

1. Centralized Template Storage

All prompts live in a single directory structure:

coordination/prompts/
├── README.md (versioning guidelines)
├── masters/
│   ├── coordinator.md
│   ├── development.md
│   ├── security.md
│   └── inventory.md
├── workers/
│   ├── implementation-worker.md
│   ├── scan-worker.md
│   ├── fix-worker.md
│   ├── test-worker.md
│   ├── review-worker.md
│   ├── pr-worker.md
│   ├── documentation-worker.md
│   ├── analysis-worker.md
│   └── catalog-worker.md
└── orchestrator/
    └── task-orchestrator.md

16 total templates: 4 masters + 9 workers + 1 orchestrator + 2 special purpose.

2. Semantic Versioning for Prompts

Each prompt follows semantic versioning (v{MAJOR}.{MINOR}.{PATCH}):

# Development Master Agent - System Prompt

**Agent Type**: Master Agent
**Version**: v2.1.0
**Last Updated**: 2025-11-27
**Token Budget**: 30,000 tokens

---

## Changelog

### v2.1.0 (2025-11-27)
- Added RAG context retrieval capabilities
- Enhanced worker spawning with context augmentation
- Improved error handling guidelines

### v2.0.0 (2025-11-15)
- BREAKING: Migrated to execution manager architecture
- Worker spawning now via EM for complex tasks
- Updated token budget allocation

### v1.0.0 (2025-11-01)
- Initial release

Version bump rules mirror software engineering:

MAJOR: Breaking changes to agent behavior or interface
MINOR: New features, backward-compatible capabilities
PATCH: Bug fixes, clarifications, minor improvements

3. Template Variables

Prompts support runtime variable replacement:

Your worker ID is: {{WORKER_ID}}
Your task: {{TASK_DESCRIPTION}}
Token budget: {{TOKEN_BUDGET}}
Repository: {{REPOSITORY}}
Knowledge base: {{KNOWLEDGE_BASE_PATH}}

Scripts perform variable substitution at spawn time:

PROMPT_FILE="coordination/prompts/masters/development.md"
PROMPT_CONTENT=$(cat "$PROMPT_FILE")

# Replace template variables
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{MASTER_ID\}\}/$MASTER_ID}"
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{SESSION_ID\}\}/$SESSION_ID}"
PROMPT_CONTENT="${PROMPT_CONTENT//\{\{TOKEN_BUDGET\}\}/$TOKEN_BUDGET}"

Worker specifications reference the template path:

{
  "worker_id": "worker-impl-001",
  "prompt_template": "coordination/prompts/workers/implementation-worker.md",
  "prompt_variables": {
    "WORKER_ID": "worker-impl-001",
    "TASK_ID": "task-500",
    "TOKEN_BUDGET": "10000",
    "REPOSITORY": "ry-ops/cortex"
  }
}

Version Management: The Prompt Registry

The prompt-manager.sh library provides version control APIs:

# Register new prompt version
register_prompt_version \
    "implementation-worker" \
    "v2.1.0" \
    "$PROMPT_CONTENT" \
    "Added RAG context retrieval"

# Get production version
production_version=$(get_production_version "implementation-worker")

# Activate a specific version
activate_prompt_version "implementation-worker-v2.1.0"

# Get prompt with A/B testing
prompt=$(get_prompt "scan-worker" --ab-test)

The registry tracks metadata in coordination/prompt-versions/registry.json:

{
  "version": "1.0.0",
  "prompts": {
    "implementation-worker": [
      {
        "version_id": "v2.1.0",
        "description": "Added RAG context retrieval",
        "file_path": "coordination/prompt-versions/implementation-worker/v2.1.0.md",
        "is_control": false,
        "created_at": "2025-11-27T10:00:00Z",
        "status": "active",
        "metrics": {
          "total_uses": 0,
          "successes": 0,
          "failures": 0,
          "success_rate": 0
        }
      }
    ]
  }
}

A/B Testing Infrastructure

The real power comes from testing prompt variants in production.

Creating an A/B Test

# Test two prompt versions for security scanning
test_id=$(start_ab_test \
    "scan-worker" \
    "v2.0.0" \      # Control
    "v2.1.0" \      # Variant
    30 \            # 30% traffic to variant
    "RAG context experiment")

# Output: ab-1732723200-a4f2c8

Traffic Splitting

Cortex uses deterministic hashing for traffic assignment:

select_ab_version() {
    local prompt_type="$1"

    # Find active test
    local active_test=$(jq -r --arg type "$prompt_type" '
        .active_tests | to_entries[] |
        select(.value.prompt_type == $type and .value.status == "running") |
        .key
    ' "$PROMPT_AB_CONFIG_FILE" | head -1)

    # Random selection based on traffic split
    local traffic_split=$(echo "$test_config" | jq -r '.traffic_split_percent')
    local random=$((RANDOM % 100))

    if [ "$random" -lt "$traffic_split" ]; then
        selected_version="$variant"
        selection_group="variant"
    else
        selected_version="$control"
        selection_group="control"
    fi

    # Return version and metadata
    jq -nc \
        --arg version "$selected_version" \
        --arg test_id "$active_test" \
        --arg group "$selection_group" \
        '{version_id: $version, ab_test_id: $test_id, test_group: $group}'
}

Same task ID always gets the same variant (reproducibility).

Recording Outcomes

After each worker completes, we record success/failure:

record_prompt_outcome \
    "$version_id" \
    "success" \
    "$prompt_type" \
    "$ab_test_id" \
    "$test_group"

Outcomes append to coordination/prompt-versions/outcomes.jsonl:

{"version_id":"v2.1.0","outcome":"success","prompt_type":"scan-worker","ab_test_id":"ab-1732723200-a4f2c8","test_group":"variant","recorded_at":"2025-11-27T15:30:00Z"}
{"version_id":"v2.0.0","outcome":"success","prompt_type":"scan-worker","ab_test_id":"ab-1732723200-a4f2c8","test_group":"control","recorded_at":"2025-11-27T15:31:00Z"}

Analyzing Results

After collecting 30+ samples per variant:

analyze_ab_test "$test_id"

Output:

{
  "test_id": "ab-1732723200-a4f2c8",
  "control": {
    "uses": 45,
    "successes": 41,
    "success_rate": 0.9111
  },
  "variant": {
    "uses": 38,
    "successes": 36,
    "success_rate": 0.9474
  },
  "improvement_percent": 3.98,
  "winner": "variant",
  "statistically_significant": true
}

Variant wins! 94.7% vs 91.1% success rate.

Auto-Promotion

With auto_promote: true in settings, winners activate automatically:

end_ab_test "$test_id" true

# Promotes variant to production
# Updates registry status
# Deprecates losing version

Real-World Example: Security Scanner Upgrade

In November 2024, we upgraded the security scan-worker prompt to include CVE context retrieval.

Hypothesis: Adding CVE database lookups improves vulnerability detection accuracy.

Implementation:

<!-- v2.0.0: Control -->
Run `npm audit` and report HIGH/CRITICAL vulnerabilities.

<!-- v2.1.0: Variant -->
Run `npm audit` and report HIGH/CRITICAL vulnerabilities.
For each CVE, retrieve detailed information from NIST NVD:
- CVSS score breakdown
- Attack vector details
- Exploitation likelihood
- Recommended remediation

Test Configuration:

Traffic split: 70% control, 30% variant (conservative)
Minimum samples: 30 per variant
Success criteria: Scan completes without errors
Quality metric: Number of actionable findings

Results after 72 hours:

Metric	Control v2.0.0	Variant v2.1.0	Change
Success rate	89.2%	91.5%	+2.3%
Avg findings	4.1	5.7	+39%
False positives	1.2	0.8	-33%
Avg duration	42s	58s	+38%

Decision: Promote v2.1.0 despite slower execution—quality improvement outweighs speed cost.

Rollback Capability

When v2.2.0 of the fix-worker caused workers to hang, rollback was instant:

# Identify problem
head -10 coordination/prompts/workers/fix-worker.md | grep Version
# Version: v2.2.0

# Check git history
git log --oneline coordination/prompts/workers/fix-worker.md

# Revert to v2.1.5
git checkout abc123 -- coordination/prompts/workers/fix-worker.md

# Update changelog
cat >> coordination/prompts/workers/fix-worker.md <<EOF

### v2.2.1 (2025-11-28)
- ROLLBACK: Reverted v2.2.0 due to worker timeouts
- Restored: v2.1.5 behavior (stable)
EOF

# Commit and redeploy
git add coordination/prompts/workers/fix-worker.md
git commit -m "rollback(prompts): fix-worker v2.2.1 - revert timeout issue"

Downtime: 4 minutes from detection to fix.

Best Practices

1. Single Responsibility Principle

Each prompt defines ONE agent role:

# Good
You are a scan-worker specialized in security vulnerability detection.

# Bad
You are a scan-worker that also fixes issues and writes documentation.

2. Explicit Success Criteria

Define measurable outcomes:

**Success Criteria:**
- All files scanned without errors
- Vulnerabilities reported in JSON format
- CVSS scores included for each finding
- Exit status 0 on completion

3. Bounded Scope

Clearly define limits:

**What you DO:**
- Scan dependencies for vulnerabilities
- Generate structured vulnerability report
- Suggest remediation steps

**What you DON'T do:**
- Apply fixes (that's fix-worker's job)
- Make deployment decisions (that's master's job)
- Modify code outside scan scope

4. Comprehensive Changelog

Track every change with context:

### v2.3.0 (2025-12-01)
- Added: Support for Python package scanning via `pip-audit`
- Changed: CVSS threshold from 7.0 to 5.0 (medium severity)
- Fixed: Parsing error with npm audit JSON output
- Deprecated: Legacy XML report format

5. Test Before Deploying

Use draft status for new versions:

# Create as draft
register_prompt_version \
    "scan-worker" \
    "v2.3.0" \
    "$PROMPT_CONTENT" \
    "Added Python scanning" \
    false  # Not control

# Test with specific version
get_prompt "scan-worker" "v2.3.0"

# Activate after validation
activate_prompt_version "scan-worker-v2.3.0"

Implementation Guide

Want to build a similar system? Here’s the blueprint:

Step 1: Centralize Prompts

Move all prompts to a single directory:

mkdir -p prompts/{masters,workers,special}
# Extract prompts from scripts
grep -r "SYSTEM_PROMPT=" scripts/ > prompts-to-migrate.txt

Step 2: Add Version Headers

Standardize format:

# {Agent Name} - System Prompt

**Agent Type**: Master | Worker
**Version**: v1.0.0
**Last Updated**: YYYY-MM-DD
**Token Budget**: N tokens

---

## Changelog

### v1.0.0 (YYYY-MM-DD)
- Initial release

Step 3: Build Prompt Registry

Create prompt-manager.sh:

register_prompt_version() {
    local prompt_type="$1"
    local version_id="$2"
    local prompt_content="$3"

    # Save prompt
    echo "$prompt_content" > "prompts/${prompt_type}/${version_id}.md"

    # Update registry.json
    jq --arg type "$prompt_type" \
       --arg version "$version_id" \
       '.prompts[$type] += [{version_id: $version, status: "active"}]' \
       registry.json > registry.tmp && mv registry.tmp registry.json
}

Step 4: Implement A/B Testing

Add traffic splitting:

select_ab_version() {
    local prompt_type="$1"
    local traffic_split=50
    local random=$((RANDOM % 100))

    if [ "$random" -lt "$traffic_split" ]; then
        echo "variant"
    else
        echo "control"
    fi
}

Step 5: Track Outcomes

Record every execution:

record_outcome() {
    local version_id="$1"
    local outcome="$2"

    echo "{\"version\":\"$version_id\",\"outcome\":\"$outcome\",\"timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" \
        >> outcomes.jsonl
}

Benefits Realized

After 3 months of production use:

Consistency

100% template compliance (automated validation)
Zero variable naming drift (centralized definitions)
Uniform changelog format (enforced by CI)

Testability

12 A/B tests run (8 promoted, 3 rolled back, 1 inconclusive)
Average test duration: 48 hours
Confidence threshold: 95% for promotion

Rollback Speed

Average rollback time: 6 minutes (detection to fix)
Longest rollback: 18 minutes (required emergency hotfix)
Rollbacks performed: 5 in 3 months

Deployment Velocity

Prompt changes per week: 2.3 (up from 0.4 before versioning)
Failed deployments: 1.2% (down from 8.7%)
Time to production: 2 hours (down from 2 days)

Key Takeaways

Treat prompts like infrastructure-as-code—version control, testing, and rollbacks are essential.
Semantic versioning works for prompts—MAJOR/MINOR/PATCH semantics map perfectly to prompt changes.
A/B testing de-risks improvements—gradual rollouts catch issues before full deployment.
Template variables reduce duplication—centralized definitions prevent copy-paste drift.
Changelogs are documentation—future you will thank past you for detailed change logs.
Automation enables velocity—manual prompt management doesn’t scale beyond 5 agents.

The full Cortex codebase is on GitHub at ry-ops/cortex. Check coordination/prompts/ for real production examples.

What’s your prompt versioning strategy? Reach out on Twitter/X or open an issue if you’re building something similar.

Next in series: Part 14: Multi-Agent Coordination Patterns - How 4 masters and 9 workers communicate via JSON specs.

🎯 Cortex Series

22 of 34

Next: Deploying a Complete SIEM Stac...

Tags: Cortex , Prompt Engineering , AI , Production

Written by

Ryan Dahlberg

Creator & Engineer

Engineering leader and builder obsessed with AI systems, DevOps automation, and infrastructure that runs itself. Created Cortex — an autonomous AI orchestration platform built in 28 days. Writes about the messy, real side of shipping software.

GitHub LinkedIn

794+ commits in 28 days building Cortex

Claude

AI Pair Programmer · Anthropic

Claude assisted with code generation, content drafting, and technical review across every post on this site. From debugging infrastructure to writing prose — a true co-pilot.

anthropic.com

Explore more from ry-ops

unifi-mcp-server

MCP server for comprehensive UniFi infrastructure monitoring and management with A2A support

Python

proxmox-mcp-server

MCP server for managing Proxmox VE VMs, containers, storage, and cluster resources

Python

cloudflare-mcp-server

Cloudflare MCP Server for managing zones, DNS, and edge infrastructure

Python

git-steer

GitHub autonomy engine — control repos, branches, security, and Actions through natural language via MCP

TypeScript

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Cleaning House: Migrating a 90-Deployment k3s Cluster to fabric-forge

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Obstacles to Teammates: How Automation Built Itself a Better Partner

Open Source

Git-Steer Can Contribute to Other People's Repos Too

Security

What the IBM X-Force Report Taught Us About Securing Our Own Tools

The Problem: Prompts in Production are Code

Architecture: Central Prompt Registry

1. Centralized Template Storage

2. Semantic Versioning for Prompts

3. Template Variables

Version Management: The Prompt Registry

A/B Testing Infrastructure

Creating an A/B Test

Traffic Splitting

Recording Outcomes

Analyzing Results

Auto-Promotion

Real-World Example: Security Scanner Upgrade

Rollback Capability

Best Practices

1. Single Responsibility Principle

2. Explicit Success Criteria

3. Bounded Scope

4. Comprehensive Changelog

5. Test Before Deploying

Implementation Guide

Step 1: Centralize Prompts

Step 2: Add Version Headers

Step 3: Build Prompt Registry

Step 4: Implement A/B Testing

Step 5: Track Outcomes

Benefits Realized

Consistency

Testability

Rollback Speed

Deployment Velocity

Key Takeaways

🎯 Cortex Series

Written by

Related posts

Complete Task Lineage: 18 Event Types That Give You Total Visibility

Building the Future: Cortex Gets a Workflow Executor

East Bound and Down: Building 4 Enterprise Features in 20 Minutes

Explore more from ry-ops

unifi-mcp-server

proxmox-mcp-server

cloudflare-mcp-server

git-steer