Master-Worker Architecture: Cortex Foundation
Master-Worker Architecture: Cortex’s Foundation
Cortex orchestrates complex workflows across multiple agents using a master-worker architecture. This pattern enables scalable, fault-tolerant, and intelligent task execution.
Let’s explore how it works.
The Pattern
Core Components
Masters (5 total)
- High-level coordinators
- Make strategic decisions
- Spawn and manage workers
- Track outcomes and learn
Workers (7 types)
- Execute specific tasks
- Report progress to masters
- Lightweight and disposable
- Designed for parallel execution
Coordination Layer
- JSONL event streams
- Task queues
- State management
- Health monitoring
Why Master-Worker?
I evaluated several architectural patterns:
❌ Monolithic Agent
Single agent does everything
+ Simple to implement
- No specialization
- Hard to scale
- Single point of failure
❌ Peer-to-Peer Agents
Agents communicate directly
+ Decentralized
- Complex coordination
- Race conditions
- Difficult debugging
✅ Master-Worker
Masters coordinate, workers execute
+ Clear responsibility
+ Easy to scale
+ Fault tolerant
+ Learnable patterns
The master-worker pattern won because it maps naturally to the Mixture of Experts concept: masters are experts, workers are executors.
The 5 Masters
Cortex’s master-worker architecture consists of one coordinator and four specialist masters:
graph TD
A[Coordinator Master<br/>Meta-coordinator & Router] --> B[Development Master<br/>Code & Implementation]
A --> C[Security Master<br/>Audits & Remediation]
A --> D[Inventory Master<br/>Cataloging & Docs]
A --> E[CI/CD Master<br/>Build & Deploy]
B --> B1[Implementation Worker]
B --> B2[Fix Worker]
B --> B3[Test Worker]
C --> C1[Scan Worker]
C --> C2[Security-Fix Worker]
D --> D1[Documentation Worker]
D --> D2[Analysis Worker]
E --> E1[Test Worker]
E --> E2[Implementation Worker]
style A fill:#30363d,stroke:#58a6ff,stroke-width:3px
style B fill:#30363d,stroke:#00d084,stroke-width:2px
style C fill:#30363d,stroke:#cf2e2e,stroke-width:2px
style D fill:#30363d,stroke:#9b51e0,stroke-width:2px
style E fill:#30363d,stroke:#ff6900,stroke-width:2px
1. Coordinator Master
Role: Meta-coordinator, routes tasks to specialist masters
Responsibilities:
- Receive incoming tasks
- Analyze task requirements
- Calculate routing confidence
- Select appropriate specialist master
- Track cross-master workflows
Decision Example:
Task: "Implement rate limiting with security audit"
Analysis:
Primary: Development (implementation)
Secondary: Security (audit)
Route:
1. Development-Master (implement)
2. Security-Master (audit)
3. CI/CD-Master (deploy)
2. Development Master
Role: Code implementation and improvements
Responsibilities:
- Feature development
- Bug fixes
- Code refactoring
- Technical debt reduction
Worker Types:
- Implementation worker
- Fix worker
- Test worker
- Analysis worker
Typical Workflow:
Task: "Add user authentication API"
↓
Development-Master receives task
↓
Spawns implementation-worker-001
↓
Worker implements feature
↓
Worker reports completion
↓
Master validates output
↓
Records pattern for learning
3. Security Master
Role: Security auditing and remediation
Responsibilities:
- Vulnerability scanning
- CVE remediation
- Security audits
- Compliance monitoring
Worker Types:
- Scan worker
- Security-fix worker
- Analysis worker
Real Example from Cortex:
Scan detected: 10 Path Traversal CVEs (CWE-23)
Security-Master:
1. Spawned scan-worker-001 (identify vulnerabilities)
2. Spawned security-fix-worker-002 (fix each CVE)
3. Spawned scan-worker-003 (verify fixes)
Result: All 10 CVEs fixed in < 2 hours
4. Inventory Master
Role: Repository cataloging and documentation
Responsibilities:
- Discover repositories
- Generate documentation
- Track dependencies
- Monitor health
Worker Types:
- Documentation worker
- Analysis worker
5. CI/CD Master
Role: Build, test, and deployment automation
Responsibilities:
- Build orchestration
- Test execution
- Deployment automation
- Release management
Worker Types:
- Test worker
- Implementation worker (for pipeline changes)
Worker Lifecycle
A worker progresses through 5 distinct states from creation to cleanup:
stateDiagram-v2
[*] --> Spawn: Master creates worker
Spawn --> Execute: Task assigned
Execute --> Report: Progress updates
Report --> Execute: Continue working
Report --> Complete: Task finished
Complete --> Cleanup: Record patterns
Cleanup --> [*]: Worker terminated
note right of Spawn
Worker ID assigned
Task context loaded
end note
note right of Execute
Autonomous execution
Real-time progress
end note
note right of Complete
Success/Failure logged
Quality score recorded
end note
1. Spawn
Master creates a worker for a specific task:
{
"worker_id": "implementation-worker-001",
"master": "development-master",
"task_id": "task-feature-123",
"priority": "high",
"created_at": "2025-11-26T10:00:00Z"
}
2. Execute
Worker runs autonomously:
{
"worker_id": "implementation-worker-001",
"status": "in_progress",
"progress": {
"files_modified": 3,
"tests_added": 12,
"completion": 0.65
}
}
3. Report
Worker sends progress updates:
{
"worker_id": "implementation-worker-001",
"event": "progress_update",
"message": "Implemented authentication endpoints",
"timestamp": "2025-11-26T10:15:00Z"
}
4. Complete
Worker finishes and reports outcome:
{
"worker_id": "implementation-worker-001",
"status": "completed",
"outcome": "success",
"quality_score": 0.92,
"artifacts": ["auth.js", "auth.test.js", "README.md"]
}
5. Cleanup
Master terminates worker and records patterns:
{
"pattern": "authentication implementation",
"master": "development-master",
"outcome": "success",
"duration_minutes": 18,
"confidence": 0.92
}
Coordination Mechanisms
Event Streams (JSONL)
Every action creates an event that flows through the coordination timeline:
sequenceDiagram
participant T as Task Queue
participant C as Coordinator
participant M as Development Master
participant W as Implementation Worker
T->>C: task_received (task-001)
Note over C: Analyze & route
C->>M: master_assigned
Note over M: Select worker type
M->>W: worker_spawned (worker-001)
Note over W: Execute task
W->>M: progress_update (30%)
W->>M: progress_update (65%)
W->>M: task_completed (success)
M->>C: outcome_recorded
Note over C: Update patterns
Every action creates an event:
{"event":"task_received","task_id":"task-001","timestamp":"2025-11-26T10:00:00Z"}
{"event":"master_assigned","master":"development-master","task_id":"task-001"}
{"event":"worker_spawned","worker_id":"implementation-worker-001","task_id":"task-001"}
{"event":"progress_update","worker_id":"implementation-worker-001","completion":0.3}
{"event":"task_completed","task_id":"task-001","outcome":"success"}
Benefits:
- Full audit trail
- Easy debugging
- Pattern analysis
- Replay capability
Task Queues
Priority-based task scheduling:
{
"task_queue": [
{"task_id": "task-001", "priority": "critical", "age_minutes": 2},
{"task_id": "task-002", "priority": "high", "age_minutes": 15},
{"task_id": "task-003", "priority": "medium", "age_minutes": 45}
]
}
State Management
Distributed state across coordination files:
coordination/
├── worker-pool.json # Active workers
├── task-queue.json # Pending tasks
├── master-health.json # Master status
├── events/ # Event streams
│ ├── coordinator-events.jsonl
│ ├── development-events.jsonl
│ └── security-events.jsonl
└── memory/
└── working/
└── pool-state.json # Current system state
Scaling Properties
Horizontal Scaling
Add more workers without changing masters:
Before: 3 workers per master
After: 20 workers per master
Change: Zero code changes, just configuration
Vertical Scaling
Add more masters for new domains:
Initial: 4 masters
Add: Documentation Master
Result: 5 masters, 8 worker types
Load Balancing
Masters automatically balance worker distribution:
if (activeWorkers < maxWorkers && taskQueue.length > 0) {
spawnNewWorker(nextTask);
}
Fault Tolerance
Worker Failures
Workers are disposable by design:
Worker crashes?
→ Master detects timeout
→ Spawns replacement worker
→ Retries task
→ Records failure pattern
Master Failures
Masters have heartbeat monitoring:
Master stops responding?
→ Coordinator detects failure
→ Fails over to backup master
→ Reassigns pending tasks
→ Alerts operators
Zombie Cleanup
Automated cleanup of stuck processes:
// Zombie cleanup daemon runs every 5 minutes
detectZombies()
.filter(worker => worker.idle_minutes > 30)
.forEach(worker => {
terminateWorker(worker.id);
logZombieCleanup(worker);
});
Performance Characteristics
Latency
Task receipt → Worker spawn: < 100ms
Worker spawn → First action: < 500ms
Total task latency: 1-30 minutes (task dependent)
Throughput
Tasks per hour: 20-100 (depending on complexity)
Concurrent workers: up to 20
Master overhead: < 5% CPU per master
Resource Usage
Master process: ~50MB RAM, 1-2% CPU
Worker process: ~100MB RAM, 5-20% CPU (task dependent)
Total system: ~1GB RAM, 15-30% CPU at peak
Real-World Example: Security Audit
Let’s trace a complex multi-master workflow that demonstrates coordination between three masters:
sequenceDiagram
participant C as Coordinator
participant SM as Security Master
participant DM as Development Master
participant W1 as Scan Workers
participant W2 as Fix Workers
participant W3 as Verify Worker
C->>C: Analyze: High complexity<br/>Domains: Security + Dev
C->>SM: Route to Security Master
Note over SM: Step 3: Security Scan
SM->>W1: Spawn 3 scan workers
W1->>W1: CVE scan<br/>Static analysis<br/>Dependency audit
W1-->>SM: 3 findings (45 min)
SM-->>C: Scan complete, 3 CVEs
C->>DM: Handoff to Development Master
Note over DM: Step 4: Remediation
DM->>W2: Spawn 3 fix workers
W2->>W2: Fix CVE-2024-001<br/>Fix CVE-2024-002<br/>Add validation
W2-->>DM: All issues resolved (90 min)
DM-->>C: Fixes complete
C->>SM: Verification handoff
Note over SM: Step 5: Verification
SM->>W3: Spawn verification worker
W3->>W3: Re-scan for CVEs
W3-->>SM: Clean scan (15 min)
SM-->>C: Audit complete ✓
Note over C: Total: 2.5 hours<br/>3 masters, 7 workers<br/>3 CVEs fixed
Task: “Comprehensive security audit of authentication feature”
Step 1: Coordinator Analysis
{
"task": "Comprehensive security audit of authentication feature",
"complexity": "high",
"domains": ["security", "development"],
"estimated_duration": "2-4 hours"
}
Step 2: Multi-Master Routing
{
"primary": "security-master",
"secondary": "development-master",
"workflow": "sequential"
}
Step 3: Security-Master Execution
{
"master": "security-master",
"workers": [
"scan-worker-001: CVE scanning",
"scan-worker-002: Static analysis",
"scan-worker-003: Dependency audit"
],
"duration": "45 minutes",
"findings": 3
}
Step 4: Development-Master Remediation
{
"master": "development-master",
"workers": [
"fix-worker-001: Fix CVE-2024-001",
"fix-worker-002: Fix CVE-2024-002",
"implementation-worker-003: Add missing validation"
],
"duration": "90 minutes",
"outcome": "all issues resolved"
}
Step 5: Verification
{
"master": "security-master",
"workers": [
"scan-worker-004: Re-scan for CVEs"
],
"duration": "15 minutes",
"result": "clean"
}
Total: 2.5 hours, 3 masters, 7 workers, 3 CVEs fixed
Key Design Decisions
1. JSONL Over Database
Why: Simplicity, append-only, easy debugging Trade-off: No complex queries, but don’t need them
2. File-Based State Over Redis
Why: No external dependencies, easy backup Trade-off: Slower than in-memory, but fast enough
3. Process-Based Workers Over Threads
Why: Better isolation, easier cleanup Trade-off: Higher overhead, but more reliable
Tomorrow’s Topic
Tomorrow, I’ll share the day-by-day story of Cortex’s 4-week build - the decisions, challenges, and breakthroughs from idea to production.
Key Takeaways
- Master-worker pattern enables scalable distributed orchestration
- 5 specialist masters handle different domains
- 7 worker types execute specific tasks
- JSONL events provide full audit trail
- Fault tolerance through disposable workers
- Performance scales horizontally and vertically
The master-worker architecture isn’t just a design pattern - it’s the foundation that makes Cortex’s self-improving MoE system possible.
Learn More About Cortex
Want to dive deeper into how Cortex works? Visit the Meet Cortex page to learn about its architecture, capabilities, and how it scales from 1 to 100+ agents on-demand.
Part 4 of the Cortex series. Next: From Idea to Production in 28 Days