Skip to main content

Running 20 Workers in Parallel: How Cortex Achieves Massive Concurrency

Ryan Dahlberg
Ryan Dahlberg
December 19, 2025 12 min read
Share:
Running 20 Workers in Parallel: How Cortex Achieves Massive Concurrency

In the world of AI agent orchestration, the difference between sequential and parallel execution isn’t just about speed—it’s about fundamentally reimagining what autonomous systems can accomplish. At Cortex, we’ve built a worker pool architecture that doesn’t just run tasks faster; it transforms how AI agents coordinate, scale, and deliver results.

This is the story of how we went from spawning workers on-demand with painful 2-second cold starts to orchestrating 20 parallel workers that can process tasks with 100ms latency—a 20x performance improvement that unlocks entirely new use cases.

The Cold Start Problem

When we first built Cortex’s worker system, we took the obvious approach: spawn a worker when you need it. Simple, right?

Wrong.

Every time a master agent needed to delegate a task—scanning a repository for vulnerabilities, implementing a feature, writing documentation—it would trigger the worker spawn process:

# The old way: On-demand spawning
$ ./scripts/spawn-worker.sh --type scan-worker --task-id task-010
# ... 2-3 seconds of initialization ...
 Worker spawned successfully

Those 2-3 seconds added up fast. Running 10 tasks sequentially meant 20-30 seconds just in cold start overhead before any actual work began. For a system designed to be autonomous and responsive, this was unacceptable.

The problem wasn’t just latency—it was resource utilization. Workers would spin up, complete a task, and shut down, never amortizing their initialization cost. We were leaving performance on the table.

Enter the Pre-Warmed Worker Pool

The solution came from studying neural network architectures, specifically Mixture of Experts (MoE) models. These models don’t activate all parameters for every input—they use sparse activation, keeping a subset of “experts” ready while efficiently routing tasks to the right ones.

We applied this concept to worker management. Instead of spawning on-demand, we maintain a pre-warmed pool of workers:

// ADR 003: Worker Pool Management Strategy
class WorkerPool {
  constructor() {
    this.minSize = 5;   // Always keep 5 workers warm
    this.maxSize = 20;  // Scale up to 20 under load
    this.pool = [];
    this.initializePool();
  }

  initializePool() {
    for (let i = 0; i < this.minSize; i++) {
      this.spawnWorker();
    }
  }
}

The Numbers Speak for Themselves

The impact was immediate and dramatic:

MetricBefore (On-Demand)After (Pool)Improvement
Worker spawn time2000ms100ms20x faster
Task throughput10 tasks/min50 tasks/min5x increase
Resource utilization20%75%3.75x better
Cold start overheadPer taskOne-timeAmortized

But this was just the beginning. The real magic happens when you combine a pre-warmed pool with intelligent parallel execution.

Parallel Execution: The MoE-Inspired Approach

Here’s where things get interesting. Traditional worker pools are great for throughput, but they’re still fundamentally sequential—one worker, one task at a time. To unlock massive concurrency, we needed to think differently.

We implemented sparse activation borrowed from Mixture of Experts architectures:

# MoE-Inspired Worker Pool Configuration
MAX_WORKER_CAPACITY=64        # Total capacity (like 7B parameters)
MIN_ACTIVATION_RATE=10        # Minimum 10% active
LIGHT_LOAD_RATE=14            # Light load: 14% (like MoE 1B/7B)
MEDIUM_LOAD_RATE=35           # Medium: 35%
HEAVY_LOAD_RATE=70            # Heavy: 70%

The system dynamically adjusts how many workers to keep active based on queue depth:

calculate_activation_rate() {
    local queue_size="$1"
    local utilization=$(echo "scale=2; $queue_size / $MAX_WORKER_CAPACITY" | bc)

    if (( $(echo "$utilization < 0.20" | bc -l) )); then
        echo "$LIGHT_LOAD_RATE"    # Sparse: 14% (9 workers)
    elif (( $(echo "$utilization < 0.50" | bc -l) )); then
        echo "$MEDIUM_LOAD_RATE"   # Medium: 35% (22 workers)
    elif (( $(echo "$utilization < 0.80" | bc -l) )); then
        echo "$HEAVY_LOAD_RATE"    # Heavy: 70% (45 workers)
    else
        echo "100"                  # Critical: All hands on deck
    fi
}

Real-World Example: Multi-Repository CVE Scan

To see this in action, consider a security audit across 20 repositories. The old sequential approach:

# Sequential execution: ~300 seconds (5 minutes)
for repo in repo1 repo2 ... repo20; do
    spawn_worker --type scan-worker --task-id "cve-scan-$repo"
    wait_for_completion
done
# Total: 20 tasks × (2s spawn + 13s scan) = 300s

With parallel execution:

# Parallel execution: ~15 seconds
parallel_spawn_workers --type scan-worker --count 20 --tasks "cve-scan-*"
# Total: Max(2s spawn, 13s scan) = 15s

That’s a 20x speedup on real-world workloads. But how do we actually orchestrate this?

Load Balancing: Intelligent Task Distribution

Running 20 workers simultaneously is one thing. Keeping them all productively busy is another. Our load balancer uses a confidence-based routing system:

// MoE Router: Confidence-based expert selection
route_task_moe() {
    local task_description="$1"

    // Calculate confidence scores for each expert type
    dev_score=$(calculate_expert_score "$task_description" "development")
    sec_score=$(calculate_expert_score "$task_description" "security")
    inv_score=$(calculate_expert_score "$task_description" "inventory")

    // Route to highest-confidence expert
    if [ $sec_score -gt 80 ]; then
        strategy="single_expert"      // High confidence: 1 worker
    elif [ $sec_score -gt 60 ]; then
        strategy="multi_expert_parallel"  // Split confidence: 2-3 workers
    fi
}

The router analyzes task descriptions using a three-layer hybrid architecture:

  1. Keyword matching (Layer 1): Fast pattern recognition
  2. NLP classification (Layer 2): Semantic understanding
  3. Claude API fallback (Layer 3): Complex reasoning for edge cases

Load Balancing Strategies

We support multiple distribution strategies:

Round-robin: Simple, fair distribution across all workers

task_001 worker-001
task_002 worker-002
task_003 worker-003
# ... cycles through pool

Least-loaded: Route to workers with smallest queue

# Worker pool state
worker-001: 3 tasks queued
worker-002: 0 tasks queued Route here
worker-003: 2 tasks queued

Type-specific: Match workers to task specialization

security-scan scan-worker-pool
feature-implementation implementation-worker-pool
documentation documentation-worker-pool

Confidence-weighted: Multi-expert activation for complex tasks

# Task: "Fix CVE and document the patch"
routing_decision = {
    primary: "security" (85% confidence),
    secondary: ["development" (70%), "inventory" (65%)]
}
# Activates 3 workers in parallel!

Resource Allocation: Staying Within Limits

With great parallelism comes great responsibility—specifically, the responsibility not to run out of memory or API tokens. Our resource allocation system tracks budgets in real-time:

{
  "token_budget": {
    "total": 1000000,
    "allocated": 200000,
    "in_use": 85000,
    "available": 715000
  },
  "allocations": {
    "worker-scan-037": {
      "master": "security-master",
      "tokens": 8000,
      "allocated_at": "2025-11-23T12:47:50-0600"
    }
  }
}

Each worker type gets a predefined budget:

case $WORKER_TYPE in
    scan-worker)
        TOKEN_BUDGET=8000
        TIMEOUT_MINUTES=15
        ;;
    implementation-worker)
        TOKEN_BUDGET=10000
        TIMEOUT_MINUTES=45
        ;;
    documentation-worker)
        TOKEN_BUDGET=6000
        TIMEOUT_MINUTES=20
        ;;
esac

Memory-Efficient Worker Recycling

To prevent memory leaks and maintain consistent performance, we recycle workers that exceed thresholds:

recycle_worker_if_needed() {
    local worker_id="$1"
    local memory_mb=$(ps -o rss= -p "$worker_pid" | awk '{print $1/1024}')

    if [ "$memory_mb" -gt 500 ]; then
        echo "♻️  Recycling worker $worker_id (memory: ${memory_mb}MB)"
        graceful_shutdown "$worker_id"
        spawn_replacement_worker "$worker_id"
    fi
}

This keeps memory usage predictable and prevents the dreaded OOM crashes.

Coordination Without Conflicts

When 20 workers are running simultaneously, coordination becomes critical. How do we prevent race conditions? How do we ensure consistent state?

Git-Based Coordination

We use Git as our coordination backbone. Each worker operates in a consistent shared state:

# Worker lifecycle: Always pull before work
pull_latest_state() {
    git pull origin main --quiet
}

# Worker completion: Atomic commits
commit_worker_results() {
    git add .
    git commit -m "feat(\$MASTER): worker-\$ID completed \$TASK"
    git push origin main
}

File-Based Locking

For critical sections, we use advisory locks:

acquire_lock() {
    local lock_file="coordination/locks/$TASK_ID.lock"

    # Try to acquire lock with timeout
    for i in {1..30}; do
        if mkdir "$lock_file" 2>/dev/null; then
            echo "$$" > "$lock_file/pid"
            return 0
        fi
        sleep 0.1
    done

    return 1  # Lock acquisition failed
}

Worker Pool State Management

The worker pool tracks state in a centralized JSON file:

{
  "active_workers": [
    {
      "worker_id": "worker-implementation-039",
      "worker_type": "implementation-worker",
      "spawned_by": "development-master",
      "status": "pending",
      "task_id": "task-elastic-apm-integration",
      "token_budget": 10000
    }
  ],
  "stats": {
    "total_spawned_today": 57,
    "total_completed_today": 3,
    "success_rate": 95
  }
}

This enables zero-conflict parallel updates through atomic Git operations.

Performance Numbers: Before and After

Let’s look at real performance benchmarks from production workloads:

Benchmark: 10-Task Security Scan

Sequential Execution (Before):

Task 1:  [====================] 15s
Task 2:  [====================] 15s
Task 3:  [====================] 15s
...
Task 10: [====================] 15s
Total: 150 seconds (2.5 minutes)

Parallel Execution (After):

Task 1-10: [====================] 15s (all parallel)
Total: 15 seconds

Speedup: 10x 🚀

Benchmark: 20-Repository CVE Audit

$ time ./scripts/multi-repo-scan.sh --parallel --workers 20

# Results
Repositories scanned: 20
CVEs found: 47
Critical: 12, High: 18, Medium: 17

Real time: 0m18.342s
Worker spawn overhead: 0.2s avg
Scan time per repo: 13.1s avg
Parallel efficiency: 94.2%

Compare this to sequential execution:

$ time ./scripts/multi-repo-scan.sh --sequential

Real time: 4m42.118s  # 20 repos × (2s spawn + 13s scan)

Speedup: 15.4x 🎯

Resource Utilization

┌─────────────────────────────────────────┐
│ Worker Pool Utilization (24 hours)     │
├─────────────────────────────────────────┤
│ Light load (0-5 active):    18h (75%)   │
│ Medium load (6-15 active):   4h (17%)   │
│ Heavy load (16-20 active):   2h (8%)    │
├─────────────────────────────────────────┤
│ Average workers active:      6.2        │
│ Peak utilization:           20 workers  │
│ Efficiency:                 78.3%       │
└─────────────────────────────────────────┘

The sparse activation strategy keeps average resource usage at 31% while maintaining the ability to scale to 100% during peak loads.

Scaling Patterns: When to Stop Scaling

Adding more workers isn’t always the answer. We’ve identified three key inflection points:

1. The Coordination Overhead Threshold

Beyond ~25 parallel workers, coordination overhead starts dominating:

Workers  Throughput  Overhead
  5      250 t/min     2%
 10      480 t/min     4%
 20      920 t/min     8%
 30     1200 t/min    18%  ← Diminishing returns
 40     1350 t/min    32%  ← Not worth it

Sweet spot: 15-20 workers for most workloads.

2. The API Rate Limit Wall

Claude API enforces rate limits that cap practical parallelism:

# Claude API Limits (Enterprise Tier)
REQUESTS_PER_MINUTE=1000
TOKENS_PER_MINUTE=400000

# At 20 workers × 8000 tokens/task
MAX_PARALLEL_TASKS=$(( TOKENS_PER_MINUTE / 8000 ))  # = 50

Practical limit: 20-30 workers before hitting rate limits.

3. The Memory Constraint

Each worker consumes ~300-500MB of memory:

# System: 32GB RAM available
# OS + overhead: 8GB
# Available for workers: 24GB

MAX_WORKERS=$(( 24000 / 400 ))  # = 60 workers maximum

Our configuration: 20 max workers leaves headroom for spikes.

Auto-Scaling Decision Tree

if [ queue_size -lt 5 ]; then
    target_workers=5      # Minimum pool
elif [ queue_size -lt 20 ]; then
    target_workers=10     # Light load
elif [ queue_size -lt 40 ]; then
    target_workers=20     # Optimal range
else
    target_workers=20     # Cap at maximum
    alert "Queue backlog - consider adding infrastructure"
fi

Cost Implications of Parallel Execution

Parallelism isn’t free. Here’s the economic reality:

Token Budget Analysis

Sequential execution (10 tasks):

Cost = 10 tasks × 8,000 tokens × $0.015/1K
     = 10 × 8 × $0.015
     = $1.20 total
Time = 150 seconds

Parallel execution (10 tasks):

Cost = 10 tasks × 8,000 tokens × $0.015/1K
     = $1.20 total (same cost!)
Time = 15 seconds (10x faster)

The key insight: Parallelism doesn’t increase token costs—you’re doing the same work, just faster.

Infrastructure Costs

What does change is infrastructure:

Sequential:
- 1 worker instance: $0.50/hour
- Runtime: 2.5 min = $0.02
- Total: $0.02 per batch

Parallel (20 workers):
- 20 worker instances: $10/hour
- Runtime: 15 sec = $0.04
- Total: $0.04 per batch

Premium: $0.02 (100% increase for 10x speed)

ROI Analysis

The ROI depends on your use case:

Development feedback loops (high value):

Sequential: 2.5 min feedback → 24 iterations/hour → $0.48/hour
Parallel:   15 sec feedback → 240 iterations/hour → $4.80/hour

Cost increase: 10x
Productivity increase: 10x
ROI: Worth it! ✅

Batch processing (lower value):

Sequential: Process 1000 repos overnight → $12
Parallel:   Process 1000 repos in 1 hour → $24

Cost increase: 2x
Time savings: 8 hours
ROI: Depends on urgency 🤔

Cost Optimization Strategies

  1. Adaptive pooling: Scale down during idle periods
# Night mode: 5 workers minimum
# Day mode: 20 workers maximum
  1. Spot instances: Use cheaper compute for non-critical tasks
WORKER_INSTANCE_TYPE="spot"  # 70% cost savings
  1. Model tier selection: Route simple tasks to Haiku, complex to Opus
if [ complexity_score -lt 4 ]; then
    model="claude-haiku"      # $0.0008/1K tokens
else
    model="claude-sonnet-4"   # $0.015/1K tokens
fi

Architecture Diagram

┌─────────────────────────────────────────────────────┐
│                   Task Queue                        │
│  [task-001] [task-002] ... [task-020]              │
└────────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│              MoE Router (Load Balancer)             │
│  • Confidence-based routing                         │
│  • Sparse activation decisions                      │
│  • Resource budget tracking                         │
└────────────────────┬────────────────────────────────┘

          ┌──────────┴──────────┬──────────┬──────────┐
          ▼                     ▼          ▼          ▼
    ┌──────────┐          ┌──────────┐  ┌──────────┐ ...
    │ Worker 1 │          │ Worker 2 │  │ Worker N │
    │ (scan)   │          │ (impl)   │  │ (docs)   │
    │ 8K toks  │          │ 10K toks │  │ 6K toks  │
    └────┬─────┘          └────┬─────┘  └────┬─────┘
         │                     │              │
         │     Parallel Execution (20 max)    │
         │                     │              │
         └─────────────────────┴──────────────┘


                ┌──────────────────────────────┐
                │   Result Aggregator          │
                │   • Voting strategy          │
                │   • Weighted merge           │
                │   • Conflict resolution      │
                └──────────────┬───────────────┘


                ┌──────────────────────────────┐
                │      Unified Output          │
                │   coordination/results/      │
                └──────────────────────────────┘

Key Takeaways

Building a parallel execution system for AI agents taught us several critical lessons:

  1. Pre-warming is non-negotiable: 2-second cold starts kill responsiveness. Always maintain a warm pool.

  2. Sparse activation scales: You don’t need all workers active all the time. MoE-inspired sparse activation (14% base, scale to 100%) is efficient and responsive.

  3. Intelligent routing matters: Confidence-based task distribution prevents resource waste and improves success rates.

  4. Coordination is hard: Git-based state + advisory locks + atomic operations are essential for conflict-free parallel execution.

  5. Scale to your constraints: Don’t chase infinite parallelism. Find your sweet spot (for us: 20 workers) based on API limits, memory, and coordination overhead.

  6. Cost follows value: Parallelism costs more in infrastructure but the same in tokens. ROI depends on whether speed matters.

  7. Monitor and adapt: Real-time metrics on worker utilization, queue depth, and resource consumption enable dynamic optimization.

What’s Next

We’re continuing to push the boundaries of parallel AI agent execution:

  • Distributed worker pools across multiple machines
  • Heterogeneous workers (mixing Claude, GPT-4, local models)
  • Predictive scaling using ML to forecast load patterns
  • Cross-repository task batching for even better parallelism

The future of AI agent orchestration is massively parallel, intelligently coordinated, and ruthlessly optimized. We’re just getting started.


Want to dive deeper into Cortex’s architecture? Check out the MoE Architecture documentation or explore the worker pool management ADR.

Have questions about parallel AI agent execution? Found this useful? Let me know on Twitter or GitHub.

#Cortex #Performance #Scalability #Architecture