Running 20 Workers in Parallel: How Cortex Achieves Massive Concurrency
In the world of AI agent orchestration, the difference between sequential and parallel execution isn’t just about speed—it’s about fundamentally reimagining what autonomous systems can accomplish. At Cortex, we’ve built a worker pool architecture that doesn’t just run tasks faster; it transforms how AI agents coordinate, scale, and deliver results.
This is the story of how we went from spawning workers on-demand with painful 2-second cold starts to orchestrating 20 parallel workers that can process tasks with 100ms latency—a 20x performance improvement that unlocks entirely new use cases.
The Cold Start Problem
When we first built Cortex’s worker system, we took the obvious approach: spawn a worker when you need it. Simple, right?
Wrong.
Every time a master agent needed to delegate a task—scanning a repository for vulnerabilities, implementing a feature, writing documentation—it would trigger the worker spawn process:
# The old way: On-demand spawning
$ ./scripts/spawn-worker.sh --type scan-worker --task-id task-010
# ... 2-3 seconds of initialization ...
✅ Worker spawned successfully
Those 2-3 seconds added up fast. Running 10 tasks sequentially meant 20-30 seconds just in cold start overhead before any actual work began. For a system designed to be autonomous and responsive, this was unacceptable.
The problem wasn’t just latency—it was resource utilization. Workers would spin up, complete a task, and shut down, never amortizing their initialization cost. We were leaving performance on the table.
Enter the Pre-Warmed Worker Pool
The solution came from studying neural network architectures, specifically Mixture of Experts (MoE) models. These models don’t activate all parameters for every input—they use sparse activation, keeping a subset of “experts” ready while efficiently routing tasks to the right ones.
We applied this concept to worker management. Instead of spawning on-demand, we maintain a pre-warmed pool of workers:
// ADR 003: Worker Pool Management Strategy
class WorkerPool {
constructor() {
this.minSize = 5; // Always keep 5 workers warm
this.maxSize = 20; // Scale up to 20 under load
this.pool = [];
this.initializePool();
}
initializePool() {
for (let i = 0; i < this.minSize; i++) {
this.spawnWorker();
}
}
}
The Numbers Speak for Themselves
The impact was immediate and dramatic:
| Metric | Before (On-Demand) | After (Pool) | Improvement |
|---|---|---|---|
| Worker spawn time | 2000ms | 100ms | 20x faster |
| Task throughput | 10 tasks/min | 50 tasks/min | 5x increase |
| Resource utilization | 20% | 75% | 3.75x better |
| Cold start overhead | Per task | One-time | Amortized |
But this was just the beginning. The real magic happens when you combine a pre-warmed pool with intelligent parallel execution.
Parallel Execution: The MoE-Inspired Approach
Here’s where things get interesting. Traditional worker pools are great for throughput, but they’re still fundamentally sequential—one worker, one task at a time. To unlock massive concurrency, we needed to think differently.
We implemented sparse activation borrowed from Mixture of Experts architectures:
# MoE-Inspired Worker Pool Configuration
MAX_WORKER_CAPACITY=64 # Total capacity (like 7B parameters)
MIN_ACTIVATION_RATE=10 # Minimum 10% active
LIGHT_LOAD_RATE=14 # Light load: 14% (like MoE 1B/7B)
MEDIUM_LOAD_RATE=35 # Medium: 35%
HEAVY_LOAD_RATE=70 # Heavy: 70%
The system dynamically adjusts how many workers to keep active based on queue depth:
calculate_activation_rate() {
local queue_size="$1"
local utilization=$(echo "scale=2; $queue_size / $MAX_WORKER_CAPACITY" | bc)
if (( $(echo "$utilization < 0.20" | bc -l) )); then
echo "$LIGHT_LOAD_RATE" # Sparse: 14% (9 workers)
elif (( $(echo "$utilization < 0.50" | bc -l) )); then
echo "$MEDIUM_LOAD_RATE" # Medium: 35% (22 workers)
elif (( $(echo "$utilization < 0.80" | bc -l) )); then
echo "$HEAVY_LOAD_RATE" # Heavy: 70% (45 workers)
else
echo "100" # Critical: All hands on deck
fi
}
Real-World Example: Multi-Repository CVE Scan
To see this in action, consider a security audit across 20 repositories. The old sequential approach:
# Sequential execution: ~300 seconds (5 minutes)
for repo in repo1 repo2 ... repo20; do
spawn_worker --type scan-worker --task-id "cve-scan-$repo"
wait_for_completion
done
# Total: 20 tasks × (2s spawn + 13s scan) = 300s
With parallel execution:
# Parallel execution: ~15 seconds
parallel_spawn_workers --type scan-worker --count 20 --tasks "cve-scan-*"
# Total: Max(2s spawn, 13s scan) = 15s
That’s a 20x speedup on real-world workloads. But how do we actually orchestrate this?
Load Balancing: Intelligent Task Distribution
Running 20 workers simultaneously is one thing. Keeping them all productively busy is another. Our load balancer uses a confidence-based routing system:
// MoE Router: Confidence-based expert selection
route_task_moe() {
local task_description="$1"
// Calculate confidence scores for each expert type
dev_score=$(calculate_expert_score "$task_description" "development")
sec_score=$(calculate_expert_score "$task_description" "security")
inv_score=$(calculate_expert_score "$task_description" "inventory")
// Route to highest-confidence expert
if [ $sec_score -gt 80 ]; then
strategy="single_expert" // High confidence: 1 worker
elif [ $sec_score -gt 60 ]; then
strategy="multi_expert_parallel" // Split confidence: 2-3 workers
fi
}
The router analyzes task descriptions using a three-layer hybrid architecture:
- Keyword matching (Layer 1): Fast pattern recognition
- NLP classification (Layer 2): Semantic understanding
- Claude API fallback (Layer 3): Complex reasoning for edge cases
Load Balancing Strategies
We support multiple distribution strategies:
Round-robin: Simple, fair distribution across all workers
task_001 → worker-001
task_002 → worker-002
task_003 → worker-003
# ... cycles through pool
Least-loaded: Route to workers with smallest queue
# Worker pool state
worker-001: 3 tasks queued
worker-002: 0 tasks queued ← Route here
worker-003: 2 tasks queued
Type-specific: Match workers to task specialization
security-scan → scan-worker-pool
feature-implementation → implementation-worker-pool
documentation → documentation-worker-pool
Confidence-weighted: Multi-expert activation for complex tasks
# Task: "Fix CVE and document the patch"
routing_decision = {
primary: "security" (85% confidence),
secondary: ["development" (70%), "inventory" (65%)]
}
# Activates 3 workers in parallel!
Resource Allocation: Staying Within Limits
With great parallelism comes great responsibility—specifically, the responsibility not to run out of memory or API tokens. Our resource allocation system tracks budgets in real-time:
{
"token_budget": {
"total": 1000000,
"allocated": 200000,
"in_use": 85000,
"available": 715000
},
"allocations": {
"worker-scan-037": {
"master": "security-master",
"tokens": 8000,
"allocated_at": "2025-11-23T12:47:50-0600"
}
}
}
Each worker type gets a predefined budget:
case $WORKER_TYPE in
scan-worker)
TOKEN_BUDGET=8000
TIMEOUT_MINUTES=15
;;
implementation-worker)
TOKEN_BUDGET=10000
TIMEOUT_MINUTES=45
;;
documentation-worker)
TOKEN_BUDGET=6000
TIMEOUT_MINUTES=20
;;
esac
Memory-Efficient Worker Recycling
To prevent memory leaks and maintain consistent performance, we recycle workers that exceed thresholds:
recycle_worker_if_needed() {
local worker_id="$1"
local memory_mb=$(ps -o rss= -p "$worker_pid" | awk '{print $1/1024}')
if [ "$memory_mb" -gt 500 ]; then
echo "♻️ Recycling worker $worker_id (memory: ${memory_mb}MB)"
graceful_shutdown "$worker_id"
spawn_replacement_worker "$worker_id"
fi
}
This keeps memory usage predictable and prevents the dreaded OOM crashes.
Coordination Without Conflicts
When 20 workers are running simultaneously, coordination becomes critical. How do we prevent race conditions? How do we ensure consistent state?
Git-Based Coordination
We use Git as our coordination backbone. Each worker operates in a consistent shared state:
# Worker lifecycle: Always pull before work
pull_latest_state() {
git pull origin main --quiet
}
# Worker completion: Atomic commits
commit_worker_results() {
git add .
git commit -m "feat(\$MASTER): worker-\$ID completed \$TASK"
git push origin main
}
File-Based Locking
For critical sections, we use advisory locks:
acquire_lock() {
local lock_file="coordination/locks/$TASK_ID.lock"
# Try to acquire lock with timeout
for i in {1..30}; do
if mkdir "$lock_file" 2>/dev/null; then
echo "$$" > "$lock_file/pid"
return 0
fi
sleep 0.1
done
return 1 # Lock acquisition failed
}
Worker Pool State Management
The worker pool tracks state in a centralized JSON file:
{
"active_workers": [
{
"worker_id": "worker-implementation-039",
"worker_type": "implementation-worker",
"spawned_by": "development-master",
"status": "pending",
"task_id": "task-elastic-apm-integration",
"token_budget": 10000
}
],
"stats": {
"total_spawned_today": 57,
"total_completed_today": 3,
"success_rate": 95
}
}
This enables zero-conflict parallel updates through atomic Git operations.
Performance Numbers: Before and After
Let’s look at real performance benchmarks from production workloads:
Benchmark: 10-Task Security Scan
Sequential Execution (Before):
Task 1: [====================] 15s
Task 2: [====================] 15s
Task 3: [====================] 15s
...
Task 10: [====================] 15s
Total: 150 seconds (2.5 minutes)
Parallel Execution (After):
Task 1-10: [====================] 15s (all parallel)
Total: 15 seconds
Speedup: 10x 🚀
Benchmark: 20-Repository CVE Audit
$ time ./scripts/multi-repo-scan.sh --parallel --workers 20
# Results
Repositories scanned: 20
CVEs found: 47
Critical: 12, High: 18, Medium: 17
Real time: 0m18.342s
Worker spawn overhead: 0.2s avg
Scan time per repo: 13.1s avg
Parallel efficiency: 94.2%
Compare this to sequential execution:
$ time ./scripts/multi-repo-scan.sh --sequential
Real time: 4m42.118s # 20 repos × (2s spawn + 13s scan)
Speedup: 15.4x 🎯
Resource Utilization
┌─────────────────────────────────────────┐
│ Worker Pool Utilization (24 hours) │
├─────────────────────────────────────────┤
│ Light load (0-5 active): 18h (75%) │
│ Medium load (6-15 active): 4h (17%) │
│ Heavy load (16-20 active): 2h (8%) │
├─────────────────────────────────────────┤
│ Average workers active: 6.2 │
│ Peak utilization: 20 workers │
│ Efficiency: 78.3% │
└─────────────────────────────────────────┘
The sparse activation strategy keeps average resource usage at 31% while maintaining the ability to scale to 100% during peak loads.
Scaling Patterns: When to Stop Scaling
Adding more workers isn’t always the answer. We’ve identified three key inflection points:
1. The Coordination Overhead Threshold
Beyond ~25 parallel workers, coordination overhead starts dominating:
Workers Throughput Overhead
5 250 t/min 2%
10 480 t/min 4%
20 920 t/min 8%
30 1200 t/min 18% ← Diminishing returns
40 1350 t/min 32% ← Not worth it
Sweet spot: 15-20 workers for most workloads.
2. The API Rate Limit Wall
Claude API enforces rate limits that cap practical parallelism:
# Claude API Limits (Enterprise Tier)
REQUESTS_PER_MINUTE=1000
TOKENS_PER_MINUTE=400000
# At 20 workers × 8000 tokens/task
MAX_PARALLEL_TASKS=$(( TOKENS_PER_MINUTE / 8000 )) # = 50
Practical limit: 20-30 workers before hitting rate limits.
3. The Memory Constraint
Each worker consumes ~300-500MB of memory:
# System: 32GB RAM available
# OS + overhead: 8GB
# Available for workers: 24GB
MAX_WORKERS=$(( 24000 / 400 )) # = 60 workers maximum
Our configuration: 20 max workers leaves headroom for spikes.
Auto-Scaling Decision Tree
if [ queue_size -lt 5 ]; then
target_workers=5 # Minimum pool
elif [ queue_size -lt 20 ]; then
target_workers=10 # Light load
elif [ queue_size -lt 40 ]; then
target_workers=20 # Optimal range
else
target_workers=20 # Cap at maximum
alert "Queue backlog - consider adding infrastructure"
fi
Cost Implications of Parallel Execution
Parallelism isn’t free. Here’s the economic reality:
Token Budget Analysis
Sequential execution (10 tasks):
Cost = 10 tasks × 8,000 tokens × $0.015/1K
= 10 × 8 × $0.015
= $1.20 total
Time = 150 seconds
Parallel execution (10 tasks):
Cost = 10 tasks × 8,000 tokens × $0.015/1K
= $1.20 total (same cost!)
Time = 15 seconds (10x faster)
The key insight: Parallelism doesn’t increase token costs—you’re doing the same work, just faster.
Infrastructure Costs
What does change is infrastructure:
Sequential:
- 1 worker instance: $0.50/hour
- Runtime: 2.5 min = $0.02
- Total: $0.02 per batch
Parallel (20 workers):
- 20 worker instances: $10/hour
- Runtime: 15 sec = $0.04
- Total: $0.04 per batch
Premium: $0.02 (100% increase for 10x speed)
ROI Analysis
The ROI depends on your use case:
Development feedback loops (high value):
Sequential: 2.5 min feedback → 24 iterations/hour → $0.48/hour
Parallel: 15 sec feedback → 240 iterations/hour → $4.80/hour
Cost increase: 10x
Productivity increase: 10x
ROI: Worth it! ✅
Batch processing (lower value):
Sequential: Process 1000 repos overnight → $12
Parallel: Process 1000 repos in 1 hour → $24
Cost increase: 2x
Time savings: 8 hours
ROI: Depends on urgency 🤔
Cost Optimization Strategies
- Adaptive pooling: Scale down during idle periods
# Night mode: 5 workers minimum
# Day mode: 20 workers maximum
- Spot instances: Use cheaper compute for non-critical tasks
WORKER_INSTANCE_TYPE="spot" # 70% cost savings
- Model tier selection: Route simple tasks to Haiku, complex to Opus
if [ complexity_score -lt 4 ]; then
model="claude-haiku" # $0.0008/1K tokens
else
model="claude-sonnet-4" # $0.015/1K tokens
fi
Architecture Diagram
┌─────────────────────────────────────────────────────┐
│ Task Queue │
│ [task-001] [task-002] ... [task-020] │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ MoE Router (Load Balancer) │
│ • Confidence-based routing │
│ • Sparse activation decisions │
│ • Resource budget tracking │
└────────────────────┬────────────────────────────────┘
│
┌──────────┴──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ...
│ Worker 1 │ │ Worker 2 │ │ Worker N │
│ (scan) │ │ (impl) │ │ (docs) │
│ 8K toks │ │ 10K toks │ │ 6K toks │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
│ Parallel Execution (20 max) │
│ │ │
└─────────────────────┴──────────────┘
│
▼
┌──────────────────────────────┐
│ Result Aggregator │
│ • Voting strategy │
│ • Weighted merge │
│ • Conflict resolution │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Unified Output │
│ coordination/results/ │
└──────────────────────────────┘
Key Takeaways
Building a parallel execution system for AI agents taught us several critical lessons:
-
Pre-warming is non-negotiable: 2-second cold starts kill responsiveness. Always maintain a warm pool.
-
Sparse activation scales: You don’t need all workers active all the time. MoE-inspired sparse activation (14% base, scale to 100%) is efficient and responsive.
-
Intelligent routing matters: Confidence-based task distribution prevents resource waste and improves success rates.
-
Coordination is hard: Git-based state + advisory locks + atomic operations are essential for conflict-free parallel execution.
-
Scale to your constraints: Don’t chase infinite parallelism. Find your sweet spot (for us: 20 workers) based on API limits, memory, and coordination overhead.
-
Cost follows value: Parallelism costs more in infrastructure but the same in tokens. ROI depends on whether speed matters.
-
Monitor and adapt: Real-time metrics on worker utilization, queue depth, and resource consumption enable dynamic optimization.
What’s Next
We’re continuing to push the boundaries of parallel AI agent execution:
- Distributed worker pools across multiple machines
- Heterogeneous workers (mixing Claude, GPT-4, local models)
- Predictive scaling using ML to forecast load patterns
- Cross-repository task batching for even better parallelism
The future of AI agent orchestration is massively parallel, intelligently coordinated, and ruthlessly optimized. We’re just getting started.
Want to dive deeper into Cortex’s architecture? Check out the MoE Architecture documentation or explore the worker pool management ADR.
Have questions about parallel AI agent execution? Found this useful? Let me know on Twitter or GitHub.