Skip to main content

Master-Worker Architecture: Cortex Foundation

Ryan Dahlberg
Ryan Dahlberg
December 5, 2025 9 min read
Share:
Master-Worker Architecture: Cortex Foundation

Master-Worker Architecture: Cortex’s Foundation

Cortex orchestrates complex workflows across multiple agents using a master-worker architecture. This pattern enables scalable, fault-tolerant, and intelligent task execution.

Let’s explore how it works.

The Pattern

Core Components

Masters (5 total)

  • High-level coordinators
  • Make strategic decisions
  • Spawn and manage workers
  • Track outcomes and learn

Workers (7 types)

  • Execute specific tasks
  • Report progress to masters
  • Lightweight and disposable
  • Designed for parallel execution

Coordination Layer

  • JSONL event streams
  • Task queues
  • State management
  • Health monitoring

Why Master-Worker?

I evaluated several architectural patterns:

❌ Monolithic Agent

Single agent does everything
+ Simple to implement
- No specialization
- Hard to scale
- Single point of failure

❌ Peer-to-Peer Agents

Agents communicate directly
+ Decentralized
- Complex coordination
- Race conditions
- Difficult debugging

✅ Master-Worker

Masters coordinate, workers execute
+ Clear responsibility
+ Easy to scale
+ Fault tolerant
+ Learnable patterns

The master-worker pattern won because it maps naturally to the Mixture of Experts concept: masters are experts, workers are executors.

The 5 Masters

Cortex’s master-worker architecture consists of one coordinator and four specialist masters:

graph TD
    A[Coordinator Master<br/>Meta-coordinator & Router] --> B[Development Master<br/>Code & Implementation]
    A --> C[Security Master<br/>Audits & Remediation]
    A --> D[Inventory Master<br/>Cataloging & Docs]
    A --> E[CI/CD Master<br/>Build & Deploy]

    B --> B1[Implementation Worker]
    B --> B2[Fix Worker]
    B --> B3[Test Worker]

    C --> C1[Scan Worker]
    C --> C2[Security-Fix Worker]

    D --> D1[Documentation Worker]
    D --> D2[Analysis Worker]

    E --> E1[Test Worker]
    E --> E2[Implementation Worker]

    style A fill:#30363d,stroke:#58a6ff,stroke-width:3px
    style B fill:#30363d,stroke:#00d084,stroke-width:2px
    style C fill:#30363d,stroke:#cf2e2e,stroke-width:2px
    style D fill:#30363d,stroke:#9b51e0,stroke-width:2px
    style E fill:#30363d,stroke:#ff6900,stroke-width:2px

1. Coordinator Master

Role: Meta-coordinator, routes tasks to specialist masters

Responsibilities:

  • Receive incoming tasks
  • Analyze task requirements
  • Calculate routing confidence
  • Select appropriate specialist master
  • Track cross-master workflows

Decision Example:

Task: "Implement rate limiting with security audit"

Analysis:
  Primary: Development (implementation)
  Secondary: Security (audit)

Route:
  1. Development-Master (implement)
  2. Security-Master (audit)
  3. CI/CD-Master (deploy)

2. Development Master

Role: Code implementation and improvements

Responsibilities:

  • Feature development
  • Bug fixes
  • Code refactoring
  • Technical debt reduction

Worker Types:

  • Implementation worker
  • Fix worker
  • Test worker
  • Analysis worker

Typical Workflow:

Task: "Add user authentication API"

Development-Master receives task

Spawns implementation-worker-001

Worker implements feature

Worker reports completion

Master validates output

Records pattern for learning

3. Security Master

Role: Security auditing and remediation

Responsibilities:

  • Vulnerability scanning
  • CVE remediation
  • Security audits
  • Compliance monitoring

Worker Types:

  • Scan worker
  • Security-fix worker
  • Analysis worker

Real Example from Cortex:

Scan detected: 10 Path Traversal CVEs (CWE-23)

Security-Master:
1. Spawned scan-worker-001 (identify vulnerabilities)
2. Spawned security-fix-worker-002 (fix each CVE)
3. Spawned scan-worker-003 (verify fixes)

Result: All 10 CVEs fixed in < 2 hours

4. Inventory Master

Role: Repository cataloging and documentation

Responsibilities:

  • Discover repositories
  • Generate documentation
  • Track dependencies
  • Monitor health

Worker Types:

  • Documentation worker
  • Analysis worker

5. CI/CD Master

Role: Build, test, and deployment automation

Responsibilities:

  • Build orchestration
  • Test execution
  • Deployment automation
  • Release management

Worker Types:

  • Test worker
  • Implementation worker (for pipeline changes)

Worker Lifecycle

A worker progresses through 5 distinct states from creation to cleanup:

stateDiagram-v2
    [*] --> Spawn: Master creates worker
    Spawn --> Execute: Task assigned
    Execute --> Report: Progress updates
    Report --> Execute: Continue working
    Report --> Complete: Task finished
    Complete --> Cleanup: Record patterns
    Cleanup --> [*]: Worker terminated

    note right of Spawn
        Worker ID assigned
        Task context loaded
    end note

    note right of Execute
        Autonomous execution
        Real-time progress
    end note

    note right of Complete
        Success/Failure logged
        Quality score recorded
    end note

1. Spawn

Master creates a worker for a specific task:

{
  "worker_id": "implementation-worker-001",
  "master": "development-master",
  "task_id": "task-feature-123",
  "priority": "high",
  "created_at": "2025-11-26T10:00:00Z"
}

2. Execute

Worker runs autonomously:

{
  "worker_id": "implementation-worker-001",
  "status": "in_progress",
  "progress": {
    "files_modified": 3,
    "tests_added": 12,
    "completion": 0.65
  }
}

3. Report

Worker sends progress updates:

{
  "worker_id": "implementation-worker-001",
  "event": "progress_update",
  "message": "Implemented authentication endpoints",
  "timestamp": "2025-11-26T10:15:00Z"
}

4. Complete

Worker finishes and reports outcome:

{
  "worker_id": "implementation-worker-001",
  "status": "completed",
  "outcome": "success",
  "quality_score": 0.92,
  "artifacts": ["auth.js", "auth.test.js", "README.md"]
}

5. Cleanup

Master terminates worker and records patterns:

{
  "pattern": "authentication implementation",
  "master": "development-master",
  "outcome": "success",
  "duration_minutes": 18,
  "confidence": 0.92
}

Coordination Mechanisms

Event Streams (JSONL)

Every action creates an event that flows through the coordination timeline:

sequenceDiagram
    participant T as Task Queue
    participant C as Coordinator
    participant M as Development Master
    participant W as Implementation Worker

    T->>C: task_received (task-001)
    Note over C: Analyze & route
    C->>M: master_assigned
    Note over M: Select worker type
    M->>W: worker_spawned (worker-001)
    Note over W: Execute task
    W->>M: progress_update (30%)
    W->>M: progress_update (65%)
    W->>M: task_completed (success)
    M->>C: outcome_recorded
    Note over C: Update patterns

Every action creates an event:

{"event":"task_received","task_id":"task-001","timestamp":"2025-11-26T10:00:00Z"}
{"event":"master_assigned","master":"development-master","task_id":"task-001"}
{"event":"worker_spawned","worker_id":"implementation-worker-001","task_id":"task-001"}
{"event":"progress_update","worker_id":"implementation-worker-001","completion":0.3}
{"event":"task_completed","task_id":"task-001","outcome":"success"}

Benefits:

  • Full audit trail
  • Easy debugging
  • Pattern analysis
  • Replay capability

Task Queues

Priority-based task scheduling:

{
  "task_queue": [
    {"task_id": "task-001", "priority": "critical", "age_minutes": 2},
    {"task_id": "task-002", "priority": "high", "age_minutes": 15},
    {"task_id": "task-003", "priority": "medium", "age_minutes": 45}
  ]
}

State Management

Distributed state across coordination files:

coordination/
├── worker-pool.json          # Active workers
├── task-queue.json           # Pending tasks
├── master-health.json        # Master status
├── events/                   # Event streams
│   ├── coordinator-events.jsonl
│   ├── development-events.jsonl
│   └── security-events.jsonl
└── memory/
    └── working/
        └── pool-state.json   # Current system state

Scaling Properties

Horizontal Scaling

Add more workers without changing masters:

Before: 3 workers per master
After: 20 workers per master
Change: Zero code changes, just configuration

Vertical Scaling

Add more masters for new domains:

Initial: 4 masters
Add: Documentation Master
Result: 5 masters, 8 worker types

Load Balancing

Masters automatically balance worker distribution:

if (activeWorkers < maxWorkers && taskQueue.length > 0) {
  spawnNewWorker(nextTask);
}

Fault Tolerance

Worker Failures

Workers are disposable by design:

Worker crashes?
→ Master detects timeout
→ Spawns replacement worker
→ Retries task
→ Records failure pattern

Master Failures

Masters have heartbeat monitoring:

Master stops responding?
→ Coordinator detects failure
→ Fails over to backup master
→ Reassigns pending tasks
→ Alerts operators

Zombie Cleanup

Automated cleanup of stuck processes:

// Zombie cleanup daemon runs every 5 minutes
detectZombies()
  .filter(worker => worker.idle_minutes > 30)
  .forEach(worker => {
    terminateWorker(worker.id);
    logZombieCleanup(worker);
  });

Performance Characteristics

Latency

Task receipt → Worker spawn: < 100ms
Worker spawn → First action: < 500ms
Total task latency: 1-30 minutes (task dependent)

Throughput

Tasks per hour: 20-100 (depending on complexity)
Concurrent workers: up to 20
Master overhead: < 5% CPU per master

Resource Usage

Master process: ~50MB RAM, 1-2% CPU
Worker process: ~100MB RAM, 5-20% CPU (task dependent)
Total system: ~1GB RAM, 15-30% CPU at peak

Real-World Example: Security Audit

Let’s trace a complex multi-master workflow that demonstrates coordination between three masters:

sequenceDiagram
    participant C as Coordinator
    participant SM as Security Master
    participant DM as Development Master
    participant W1 as Scan Workers
    participant W2 as Fix Workers
    participant W3 as Verify Worker

    C->>C: Analyze: High complexity<br/>Domains: Security + Dev
    C->>SM: Route to Security Master

    Note over SM: Step 3: Security Scan
    SM->>W1: Spawn 3 scan workers
    W1->>W1: CVE scan<br/>Static analysis<br/>Dependency audit
    W1-->>SM: 3 findings (45 min)
    SM-->>C: Scan complete, 3 CVEs

    C->>DM: Handoff to Development Master

    Note over DM: Step 4: Remediation
    DM->>W2: Spawn 3 fix workers
    W2->>W2: Fix CVE-2024-001<br/>Fix CVE-2024-002<br/>Add validation
    W2-->>DM: All issues resolved (90 min)
    DM-->>C: Fixes complete

    C->>SM: Verification handoff

    Note over SM: Step 5: Verification
    SM->>W3: Spawn verification worker
    W3->>W3: Re-scan for CVEs
    W3-->>SM: Clean scan (15 min)
    SM-->>C: Audit complete ✓

    Note over C: Total: 2.5 hours<br/>3 masters, 7 workers<br/>3 CVEs fixed

Task: “Comprehensive security audit of authentication feature”

Step 1: Coordinator Analysis

{
  "task": "Comprehensive security audit of authentication feature",
  "complexity": "high",
  "domains": ["security", "development"],
  "estimated_duration": "2-4 hours"
}

Step 2: Multi-Master Routing

{
  "primary": "security-master",
  "secondary": "development-master",
  "workflow": "sequential"
}

Step 3: Security-Master Execution

{
  "master": "security-master",
  "workers": [
    "scan-worker-001: CVE scanning",
    "scan-worker-002: Static analysis",
    "scan-worker-003: Dependency audit"
  ],
  "duration": "45 minutes",
  "findings": 3
}

Step 4: Development-Master Remediation

{
  "master": "development-master",
  "workers": [
    "fix-worker-001: Fix CVE-2024-001",
    "fix-worker-002: Fix CVE-2024-002",
    "implementation-worker-003: Add missing validation"
  ],
  "duration": "90 minutes",
  "outcome": "all issues resolved"
}

Step 5: Verification

{
  "master": "security-master",
  "workers": [
    "scan-worker-004: Re-scan for CVEs"
  ],
  "duration": "15 minutes",
  "result": "clean"
}

Total: 2.5 hours, 3 masters, 7 workers, 3 CVEs fixed

Key Design Decisions

1. JSONL Over Database

Why: Simplicity, append-only, easy debugging Trade-off: No complex queries, but don’t need them

2. File-Based State Over Redis

Why: No external dependencies, easy backup Trade-off: Slower than in-memory, but fast enough

3. Process-Based Workers Over Threads

Why: Better isolation, easier cleanup Trade-off: Higher overhead, but more reliable

Tomorrow’s Topic

Tomorrow, I’ll share the day-by-day story of Cortex’s 4-week build - the decisions, challenges, and breakthroughs from idea to production.

Key Takeaways

  1. Master-worker pattern enables scalable distributed orchestration
  2. 5 specialist masters handle different domains
  3. 7 worker types execute specific tasks
  4. JSONL events provide full audit trail
  5. Fault tolerance through disposable workers
  6. Performance scales horizontally and vertically

The master-worker architecture isn’t just a design pattern - it’s the foundation that makes Cortex’s self-improving MoE system possible.

Learn More About Cortex

Want to dive deeper into how Cortex works? Visit the Meet Cortex page to learn about its architecture, capabilities, and how it scales from 1 to 100+ agents on-demand.


Part 4 of the Cortex series. Next: From Idea to Production in 28 Days

#architecture #scalability #Distributed Systems #Cortex #Design Patterns