Skip to main content

From Development to Distributed: Building a Self-Executing Multi-Agent System

Ryan Dahlberg
Ryan Dahlberg
December 26, 2025 16 min read
Share:
From Development to Distributed: Building a Self-Executing Multi-Agent System

TL;DR

We built a chat interface that creates tasks in natural language. Those tasks get processed by a distributed multi-agent system running on a 7-node Kubernetes cluster. The system is completely autonomous - it doesn’t need the development machine to run. And in the ultimate meta-achievement: the first 19 tasks the chat created were instructions to build the infrastructure that executed them.

The system built itself.


The Vision: Development Machine ↔ K8s Alignment

The Problem

Most development workflows look like this:

Developer writes code on laptop

Manually tests locally

Pushes to Git

CI/CD builds and deploys to cluster

Hope it works the same way

The disconnect: What runs on your MacBook M1 often behaves differently in production. Environment variables are different. File paths change. Network topology is different. Dependencies might not match.

Our Approach: Parallel Evolution

Instead of treating local development and cluster deployment as separate worlds, we aligned them from day one:

Local Development (M1 MacBook Pro):

  • /Users/ryandahlberg/Projects/cortex/ - Full source code
  • coordination/masters/ - Master agent definitions
  • coordination/tasks/ - Task queue and processing
  • Scripts and daemons for local orchestration

K3s Cluster (7 Nodes - 3 Control Plane, 4 Workers):

  • cortex namespace - Orchestrator and core services
  • cortex-system namespace - MCP servers, masters, workers
  • cortex-chat namespace - Chat interface and backend
  • Identical task schema, same processing logic

The Alignment: Changes made locally can be deployed to k8s with confidence because they share:

  • Same task format (JSON schema)
  • Same processing patterns
  • Same tool interfaces (kubectl, MCP servers)
  • Same Claude AI models

The Journey: Four Major Milestones

Milestone 1: Chat Interface That Actually Works

The Old Way: User asks question → Chat responds with “I don’t have access to that”

The New Way: User asks question → Chat creates task → Task gets executed → User gets real data

We built a chat backend with 5 Cortex-specific tools:

cortex_list_agents       // Query available masters and workers
cortex_get_tasks         // Check task queue status
cortex_get_metrics       // System health and performance
cortex_create_task       // Submit new work (THE GAME CHANGER)
cortex_get_task_status   // Monitor progress

The Magic: cortex_create_task is optimized for parallel submission. Claude can call it multiple times in a single turn, creating dozens of tasks simultaneously.

Performance:

  • Old: Single task creation in ~2-3 seconds
  • New: 19 tasks created in 118 milliseconds

Milestone 2: Task Processing That Doesn’t Need Your Laptop

The Challenge: The chat was creating tasks, but they were just sitting in /app/tasks/ inside the k8s pod. No one was processing them. The real task processor was running on the Mac.

The Solution: We added autonomous task processing to the k8s orchestrator:

// Task processing loop - runs every 5 seconds
async function processTasks() {
  const tasks = await findQueuedTasks();

  for (const task of tasks) {
    // Update status
    task.status = 'in_progress';

    // Execute with Claude AI
    const result = await executeTaskWithClaude(task);

    // Save results
    task.status = 'completed';
    task.result = result;

    // Write back to filesystem
    await saveTask(task);
  }
}

Now running in k8s:

  • ✅ Autonomous processing (no Mac required)
  • ✅ Claude AI integration with tool access
  • ✅ kubectl commands work (deployed in cluster)
  • ✅ MCP server access (Proxmox, UniFi, Sandfly, etc.)
  • ✅ Error handling and retry logic
  • ✅ Rate limit handling (429 → wait → retry)

Milestone 3: The Meta-Achievement

This is where it gets wild.

User sent this request to chat:

“Evaluate Cortex’s current infrastructure and create a summary of how we can implement a multi-agent system”

Chat (powered by Claude) responded by creating 19 tasks:

PHASE 1: Foundation
1.1: Fix and Stabilize Core Infrastructure
1.2: Deploy Master Agent Pool (5 categories)
1.3: Deploy Worker Agent Pool (15 workers)

PHASE 2: Agent Intelligence
2.1: Build Inter-Agent Communication
2.2: Implement Shared Knowledge Base
2.3: Build Coordination System
2.4: Implement Task Decomposition
2.5: Security Master + Workers
2.6: Development Master + Workers
2.7: Infrastructure Master + Workers
2.8: Inventory Master + Workers
2.9: CI/CD Master + Workers

PHASE 3: Advanced Capabilities
3.1: Learning and Reflection System
3.2: Dynamic Agent Scaling
3.3: Cross-Category Collaboration
3.4: Multi-LLM Backend
3.5: Safety and Predictability Controls
3.6: Decision-Making Enhancement

PHASE 4: Operations
4.1: Comprehensive Monitoring System

Then this happened:

  1. Tasks written to /app/tasks/task-chat-1766780166*.json
  2. Orchestrator found them (5-second polling loop)
  3. Started processing with Claude AI
  4. Task 1.2: “Deploy Master Agent Pool”
    • Claude used kubectl to create master-agent-registry ConfigMap
    • Defined 5 masters with capabilities and routing rules
  5. Task 1.3: “Deploy Worker Agent Pool”
    • Claude used kubectl to create worker pool configuration
    • Defined 15 specialized workers

The system built its own infrastructure by executing the tasks that described how to build it.

Milestone 4: Complete Mac Independence

Before:

Chat creates task

Saved in k8s pod

❌ Nothing happens (Mac required for processing)

After:

Chat creates task

Saved in k8s pod

Orchestrator picks it up (5-second polling)

Claude AI executes with full tool access

Results saved back to task file

✅ Complete - Mac sleeping in backpack

The Architecture: How It All Fits Together

Component Map

┌─────────────────────────────────────────────────────┐
│  User's Browser                                     │
│  https://chat.ry-ops.dev                           │
└────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│  cortex-chat Namespace                              │
│  ├─ Frontend (Vite + React)                        │
│  ├─ Backend (Hono + TypeScript)                    │
│  └─ Redis (Conversation persistence)               │
└────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│  cortex Namespace                                   │
│  └─ cortex-orchestrator                            │
│     ├─ API Endpoints (/execute-tool, /api/tasks)   │
│     ├─ Task Processing Loop (every 5s)             │
│     ├─ Claude AI Integration                       │
│     └─ Tool Execution (kubectl, MCP)               │
└────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│  cortex-system Namespace                            │
│  ├─ MCP Servers                                    │
│  │  ├─ Proxmox MCP (VM management)                 │
│  │  ├─ UniFi MCP (Network monitoring)              │
│  │  ├─ Sandfly MCP (Security scanning)             │
│  │  └─ Cloudflare MCP (DNS/CDN)                    │
│  ├─ Master Agents (5 categories)                   │
│  │  ├─ development-master                          │
│  │  ├─ security-master                             │
│  │  ├─ infrastructure-master                        │
│  │  ├─ inventory-master                             │
│  │  └─ cicd-master                                  │
│  └─ Worker Pool (15 specialized workers)           │
└─────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│  K3s Cluster Infrastructure                         │
│  ├─ 3 Control Plane Nodes (k3s-master01-03)        │
│  ├─ 4 Worker Nodes (k3s-worker01-04)               │
│  ├─ Flannel VXLAN Networking                       │
│  ├─ Traefik Ingress Controller                     │
│  └─ MetalLB Load Balancer                          │
└─────────────────────────────────────────────────────┘

The Data Flow

User Request → Task Execution:

  1. User types: “What pods are running in cortex-system?”
  2. Chat Backend: Calls Claude API with cortex_create_task tool
  3. Claude decides: This is a query task, create it
  4. Task created: Written to /app/tasks/task-chat-1766780XXX.json
  5. Orchestrator polls: Finds new task (within 5 seconds)
  6. Claude executes: Calls kubectl tool → kubectl get pods -n cortex-system
  7. Results captured: Output saved to task.result
  8. Status updated: task.status = ‘completed’
  9. User sees: Real-time pod list in chat

Total time: ~6-10 seconds (including AI processing)


The Technical Achievements

1. Parallel Task Creation (Chat → Cortex)

The Breakthrough: Claude can submit multiple cortex_create_task calls in a single API turn.

Example from real logs:

[ClaudeService] Tool use detected: cortex_create_task (19 times)
[ToolExecutor] Executing 19 tasks in parallel...
[CortexAPI] Task created: task-chat-1766780166635 (118ms)
[CortexAPI] Task created: task-chat-1766780166681 (118ms)
...
[CortexAPI] All 19 tasks created in 118ms

Why this matters: Complex requests get decomposed into parallel work streams automatically. The user gets faster results because work happens concurrently.

2. Graceful Error Handling

The Pattern: When tools fail, return structured errors (not exceptions)

// Tool executor returns
{
  type: 'tool_result',
  tool_use_id: toolUseId,
  content: JSON.stringify({ error: result.error, success: false }),
  is_error: true  // ← Claude sees this as feedback, not a crash
}

What this enables:

  • Claude adapts when tools aren’t available
  • System continues despite individual failures
  • Better responses: “I tried X but got error Y, so I tried Z instead”

3. Rate Limit Resilience

The Challenge: Claude API has rate limits (30,000 tokens/minute)

The Solution: Built-in retry with exponential backoff

if (response.status === 429) {
  console.log(`Rate limited, retrying in ${delay}ms (attempt ${attempt}/2)`);
  await sleep(delay);
  return executeWithClaude(query, attempt + 1);
}

Result: System self-heals during high load periods

4. Development → Production Parity

The Alignment Strategy:

AspectLocal DevelopmentK8s Production
Task Format/coordination/tasks/*.json/app/tasks/*.json
ProcessingShell scripts + Node.jsNode.js in container
AI ModelClaude Sonnet 4.5Claude Sonnet 4.5
Toolskubectl (local context)kubectl (in-cluster)
MCP ServersPort forwards to clusterDirect service DNS

Benefit: Code tested locally works identically in production


The Numbers: Performance Metrics

Task Processing Performance

Test: Process 19 complex multi-agent tasks

MetricValue
Total Tasks19
Task Creation Time118 milliseconds
Average Processing Time~90 seconds per task
Total Execution Time~30 minutes
Success Rate100% (0 failures)
Rate Limit Hits3 (all recovered automatically)
Mac CPU Usage0% (system running in k8s)

Infrastructure Utilization

K3s Cluster Resources:

ResourceAllocatedUsedEfficiency
CPU28 cores (7 nodes × 4 cores)~8-12 cores active43%
Memory56 GB (7 nodes × 8 GB)~24 GB43%
Storage700 GB (7 nodes × 100 GB)~180 GB26%
Network1 Gbps per nodeBurst to 400 MbpsVariable

Pod Distribution:

NamespacePodsPurpose
cortex2Orchestrator + API
cortex-chat6Chat interface, backend, Redis
cortex-system18MCP servers, masters, workers, databases
kube-system15K3s core services
monitoring12Prometheus, Grafana, exporters
Total53Distributed workload

Chat Performance

Response Times:

Query TypeTimeNotes
Simple query (cached data)2-4sRedis lookup + AI response
Tool execution (1 tool)4-8sAPI call + tool + AI
Complex (multiple tools)8-15sParallel tool execution
Task creation (19 tasks)0.12sFile writes only
Task processing90s avgFull Claude execution

High Fives to the 7-Node Cluster

Let’s give credit where it’s due - to each member of the team:

Control Plane Nodes

k3s-master01 (10.88.145.196)

  • Role: Primary control plane, etcd leader
  • Personality: The responsible one who keeps everyone in sync
  • Achievement: Handled 10,000+ API requests during task processing without breaking a sweat
  • IP: 10.88.145.196

k3s-master02 (10.88.145.197)

  • Role: Control plane replica, etcd member
  • Personality: The backup singer who’s ready to take the mic
  • Achievement: Seamless failover during master01 maintenance
  • IP: 10.88.145.197

k3s-master03 (10.88.145.198)

  • Role: Control plane replica, etcd member
  • Personality: The quiet achiever in the back row
  • Achievement: Quorum keeper - saved the day during network hiccup
  • IP: 10.88.145.198

Worker Nodes

k3s-worker01 (10.88.145.199)

  • Role: Heavy lifting - runs cortex-orchestrator
  • Personality: The workhorse that never complains
  • Achievement: Processed all 19 tasks while serving API requests
  • Current Load: cortex-orchestrator, monitoring exporters, MetalLB
  • IP: 10.88.145.199

k3s-worker02 (10.88.145.200)

  • Role: MCP server host (Proxmox, UniFi)
  • Personality: The connector - talks to all external systems
  • Achievement: 4,500+ MCP tool calls during task execution
  • Current Load: Proxmox MCP, UniFi MCP, Redis replicas
  • IP: 10.88.145.200

k3s-worker03 (10.88.145.201)

  • Role: Chat and frontend services
  • Personality: The people person facing users
  • Achievement: Zero downtime during 75+ deployments this week
  • Current Load: cortex-chat frontend/backend, Redis master
  • IP: 10.88.145.201

k3s-worker04 (10.88.145.202)

  • Role: Security and monitoring
  • Personality: The vigilant guardian
  • Achievement: Detected and reported 3 anomalies during development
  • Current Load: Sandfly MCP, security-master, Prometheus, Grafana
  • IP: 10.88.145.202

Lessons Learned

1. Start with Alignment, Not Migration

Wrong approach:

  • Build everything on laptop
  • Get it working perfectly locally
  • Try to “migrate” to k8s
  • Fight for weeks with environment differences

Right approach:

  • Define shared schemas from day one
  • Test locally AND in k8s simultaneously
  • Keep local as the development sandbox
  • Keep k8s as the production reality check

2. Make Tools Fail Gracefully

The is_error: true pattern in tool results was a game-changer. Instead of:

throw new Error("Tool failed!");  // Crashes the whole flow

Do this:

return {
  success: false,
  error: "Tool failed but here's why...",
  is_error: true  // Claude adapts and continues
}

3. Embrace the Meta-Loop

We didn’t expect this, but having the system build itself was incredibly powerful:

  • Chat creates infrastructure tasks
  • Infrastructure executes those tasks
  • Tasks deploy more infrastructure
  • New infrastructure processes more tasks

It’s not turtles all the way down - it’s agents all the way up.

4. Parallel > Sequential (When Possible)

Old approach: “Create task 1, wait for completion, create task 2…” New approach: “Create all 19 tasks in one shot, let them race”

Result: 19× faster task creation, better resource utilization

5. Monitor Everything, But Make It Useful

We have:

  • Prometheus scraping 40+ metrics
  • Grafana with 8 dashboards
  • Task execution logs
  • Cluster health monitoring

But the most useful debug tool? Simple file-based status in task JSON:

{
  "id": "task-123",
  "status": "in_progress",
  "started_at": "2024-12-26T20:15:00Z",
  "last_tool_used": "kubectl",
  "iterations": 3,
  "current_action": "Deploying master agents..."
}

Sometimes the simplest solution is the best.


What’s Next: The Roadmap

Short Term

  1. Enhanced Task Monitoring

    • Real-time dashboard showing active tasks
    • Progress bars for long-running operations
    • Estimated completion times
  2. Worker Auto-Scaling

    • Deploy workers dynamically based on queue depth
    • Scale down during idle periods
    • Cost optimization
  3. Multi-LLM Support

    • Add fallback to GPT-4 when Claude is rate-limited
    • Route simple tasks to cheaper models (Claude Haiku)
    • Cost/performance optimization

Medium Term

  1. Master Agent Intelligence

    • Masters can delegate to other masters
    • Cross-category collaboration for complex tasks
    • Voting mechanisms for uncertain decisions
  2. Knowledge Base Integration

    • Shared memory across tasks
    • Learn from previous executions
    • Pattern recognition and optimization
  3. Human-in-the-Loop Gates

    • Approval required for destructive operations
    • Confidence scoring (low confidence → ask human)
    • Audit trail for all decisions

Long Term

  1. Full Autonomy

    • System identifies problems proactively
    • Self-healing without human intervention
    • Capacity planning and resource optimization
  2. Multi-Cluster Support

    • Deploy to production k8s clusters
    • Geographic distribution
    • Disaster recovery
  3. API Marketplace

    • Expose Cortex capabilities as public API
    • Other teams can submit tasks
    • Usage metering and billing

The Bigger Picture: Why This Matters

For Developers

Before: “I need to manually deploy this service, check logs, update configs…” After: “Hey Cortex, deploy the new auth service and migrate the database”

Natural language → Automated execution

For Operations

Before: “Server is down, I need to SSH in, check logs, restart services…” After: Cortex detects failure, analyzes logs, restarts automatically, reports root cause

Self-healing infrastructure

For the Industry

We’re proving that truly autonomous systems are possible:

  • AI that can execute (not just suggest)
  • Infrastructure that adapts (not just runs)
  • Development that scales (not just deploys)

This is what infrastructure looks like when agents are first-class citizens.


Conclusion: The System That Built Itself

On December 26, 2024, at approximately 8:16 PM CST, a user sent a chat message asking for help implementing a multi-agent system.

The chat created 19 tasks describing how to build that system.

The Cortex orchestrator picked up those tasks and executed them.

The system built itself.

This is the alignment we were striving for:

  • Development machine and k8s cluster working in harmony
  • Local changes deployed with confidence
  • Autonomous execution without manual intervention
  • Infrastructure that evolves based on natural language requests

The journey from “chat that can’t do anything” to “system that builds itself” took weeks of hard work. But the result is something special:

A distributed multi-agent system running on 7 nodes that processes tasks created by a chat interface, using AI to execute kubectl commands and MCP tools, with zero dependency on the development machine.

High fives to all seven cluster nodes. You earned it.


Technical Appendix

Complete Tool Catalog

Cortex Tools (Chat → Orchestrator):

cortex_list_agents       // List all masters and workers
cortex_get_tasks         // Query task queue status
cortex_get_metrics       // System health metrics
cortex_create_task       // Submit new work (parallel-optimized)
cortex_get_task_status   // Check task progress

MCP Tools (Orchestrator → External Systems):

// Proxmox VE
proxmox_list_nodes       // Cluster nodes
proxmox_list_vms         // Virtual machines
proxmox_get_vm_status    // VM health
proxmox_get_cluster_resources  // Resource usage

// UniFi Network
unifi_list_devices       // Network devices
unifi_get_device_stats   // Device metrics
unifi_list_clients       // Connected clients

// Sandfly Security
sandfly_get_results      // Security scan results
sandfly_query            // Custom queries

// Cloudflare
cloudflare_list_zones    // DNS zones
cloudflare_get_dns       // DNS records

Kubernetes Tools (Orchestrator → Cluster):

kubectl get pods
kubectl get deployments
kubectl get services
kubectl describe pod
kubectl logs
kubectl apply -f
kubectl delete

Task Schema

{
  "id": "task-chat-1766780166635-j8e0t1upn",
  "type": "user_query",
  "priority": 1,
  "status": "queued | in_progress | completed | failed",
  "payload": {
    "query": "What to do",
    "title": "Human-readable title",
    "category": "development | security | infrastructure | inventory | cicd | general"
  },
  "metadata": {
    "created_at": "2024-12-26T20:16:06.635Z",
    "updated_at": "2024-12-26T20:16:06.655Z",
    "source": "chat",
    "iterations": 3,
    "tools_used": ["kubectl", "proxmox_list_vms"]
  },
  "result": {
    "summary": "Task completed successfully",
    "details": "...",
    "execution_time_ms": 89456
  }
}

Cluster Specifications

Node Hardware:

  • CPU: 4 cores per node (Intel/AMD x64)
  • RAM: 8 GB per node
  • Storage: 100 GB per node (SSD)
  • Network: 1 Gbps Ethernet

K3s Version: v1.28.2+k3s1 Container Runtime: containerd CNI: Flannel (VXLAN) Ingress: Traefik v2 Load Balancer: MetalLB Storage: local-path-provisioner

Total Cluster Capacity:

  • 28 CPU cores
  • 56 GB RAM
  • 700 GB storage
  • 7 Gbps network aggregate

Built with: Claude Sonnet 4.5, TypeScript, Kubernetes, lots of coffee, and a healthy dose of “what if we tried this crazy idea?”

Status: Production-ready and processing tasks autonomously

#Kubernetes #Multi-Agent Systems #AI #Infrastructure #Distributed Systems #Claude AI