Skip to main content

Building a Local LLM Router Mesh for Cortex Orchestration

Ryan Dahlberg
Ryan Dahlberg
January 3, 2026 8 min read
Share:
Building a Local LLM Router Mesh for Cortex Orchestration

Building a Local LLM Router Mesh for Cortex Orchestration

In my Cortex project, I’ve built a growing ecosystem of MCP (Model Context Protocol) servers — UniFi network management, Proxmox virtualization, Grafana dashboards, Elastic observability, and more. The original architecture was simple: every incoming request hits a central orchestrator (Claude or Cortex), which decides where to route it.

Simple, but expensive. And slow.

Every routing decision — even trivial ones like “show me connected clients” — required a round-trip to a cloud LLM. That’s 200-500ms of latency and API tokens burned just to figure out that the request should go to the UniFi MCP server.

The insight that changed everything: most routing decisions are simple classification tasks. You don’t need a 200B parameter model to categorize intent. A tiny local model can handle 80-90% of these decisions in milliseconds.

Reserve the heavy reasoning for requests that actually need it.

The Architecture Shift

Before: Monolithic Cloud Orchestration

Every request, regardless of complexity, followed the same path:

graph TD
    A[User Request] --> B[Claude / Cortex<br/>Cloud API]
    B --> C{Routing Decision}
    C --> D[UniFi MCP]
    C --> E[Proxmox MCP]
    C --> F[Grafana MCP]
    C --> G[Elastic MCP]

    B -.->|"⏱️ 300ms+ latency"| H["💰 100% API cost"]

    style B fill:#30363d,stroke:#cf2e2e,stroke-width:2px
    style H fill:#30363d,stroke:#ff6900,stroke-width:2px,stroke-dasharray: 5 5

This worked fine at low volume. But as request rates climbed, the costs and latency became untenable.

After: Distributed Router Mesh

The new architecture introduces a local routing layer that handles the majority of decisions:

graph TD
    A[User Request] --> B[Local Router Mesh]
    B --> C{L1: SmolLM<br/>135M params}
    C -->|"Confidence ≥ 0.85<br/>(80% of requests)"| D[Route Directly]
    C -->|"Confidence < 0.85"| E{L2: Qwen<br/>500M params}
    E -->|"Confidence ≥ 0.75<br/>(15% of requests)"| D
    E -->|"Confidence < 0.75<br/>(5% of requests)"| F[Cortex/Claude<br/>Full Reasoning]

    D --> G[MCP Servers]
    F --> G

    B -.->|"⏱️ 20-50ms fast path"| H["💰 88% cost reduction"]

    style C fill:#30363d,stroke:#00d084,stroke-width:2px
    style E fill:#30363d,stroke:#ff6900,stroke-width:2px
    style F fill:#30363d,stroke:#cf2e2e,stroke-width:2px
    style H fill:#30363d,stroke:#00d084,stroke-width:2px,stroke-dasharray: 5 5
  1. User sends request
  2. Request hits local router mesh (20-50ms)
  3. L1 model (SmolLM) classifies intent
  4. If confident → route directly to MCP server
  5. If uncertain → escalate to L2 model (Qwen)
  6. If still uncertain → escalate to Cortex for full reasoning

The key insight: fail open, not closed. When local models aren’t confident, they escalate rather than guessing. This preserves accuracy while dramatically improving the common case.

Model Selection

Choosing the right models for each layer required balancing speed, accuracy, and resource consumption:

graph LR
    subgraph L1["Layer 1: SmolLM-135M"]
        A1["⚡ ~20ms latency"]
        A2["💾 512MB RAM"]
        A3["📊 80% traffic"]
    end

    subgraph L2["Layer 2: Qwen2.5-0.5B"]
        B1["⚡ ~50ms latency"]
        B2["💾 1-2GB RAM"]
        B3["📊 15% traffic"]
    end

    subgraph L3["Layer 3: Cortex/Claude"]
        C1["⚡ ~300ms latency"]
        C2["☁️ Cloud API"]
        C3["📊 5% traffic"]
    end

    L1 -->|"Low confidence"| L2
    L2 -->|"Low confidence"| L3

    style L1 fill:#30363d,stroke:#00d084,stroke-width:2px
    style L2 fill:#30363d,stroke:#ff6900,stroke-width:2px
    style L3 fill:#30363d,stroke:#cf2e2e,stroke-width:2px

L1: SmolLM-135M

The first layer needs to be fast. SmolLM at 135M parameters fits in ~512MB of RAM and delivers 100+ tokens/second on CPU. It handles obvious cases — requests with clear keywords that map directly to a specific MCP server.

L2: Qwen2.5-0.5B

When L1 isn’t confident, we escalate to Qwen. At 500M parameters, it’s still small enough for CPU inference (~50 tokens/second) but significantly more capable at understanding nuanced requests.

L3: Cortex/Claude

Complex multi-step tasks, ambiguous requests, and anything requiring actual reasoning escalates to the cloud. But now this represents only 10-15% of traffic instead of 100%.

The Confidence Cascade

The magic is in the confidence thresholds. Each model returns a classification and a confidence score derived from token log probabilities:

sequenceDiagram
    participant U as User
    participant R as Router
    participant L1 as SmolLM (L1)
    participant L2 as Qwen (L2)
    participant C as Cortex (L3)
    participant M as MCP Server

    Note over U,M: Example 1: Simple Request

    U->>R: "show me connected clients"
    R->>L1: Classify intent
    L1-->>R: route: unifi<br/>confidence: 0.94
    Note over R: ✓ 0.94 ≥ 0.85
    R->>M: Forward to unifi-mcp
    M-->>U: [Client list]

    Note over U,M: Example 2: Complex Request

    U->>R: "check network and spin up test VM"
    R->>L1: Classify intent
    L1-->>R: route: unifi<br/>confidence: 0.62
    Note over R: ✗ 0.62 < 0.85<br/>Escalate to L2
    R->>L2: Classify intent
    L2-->>R: route: cortex<br/>confidence: 0.88
    Note over R: ✓ Multi-step detected
    R->>C: Full reasoning
    C->>M: Execute multi-step workflow
    M-->>U: [Results]

The thresholds aren’t arbitrary — they’re tuned based on observed accuracy:

ThresholdEffect
L1 > 0.85Route directly, ~95% accuracy
L1 < 0.85, L2 > 0.75Route via L2, ~92% accuracy
L2 < 0.75Escalate to Cortex, full reasoning

Kubernetes Deployment

Running this in my k3s cluster enables horizontal scaling and high availability. Each router pod contains three containers:

graph TB
    subgraph pod["Router Pod (localhost communication)"]
        direction TB
        ORCH["Orchestrator<br/>:8080<br/>512Mi RAM / 1 CPU"]
        SMOL["SmolLM Server<br/>:8081<br/>512Mi RAM / 2 CPU"]
        QWEN["Qwen Server<br/>:8082<br/>2Gi RAM / 4 CPU"]

        ORCH -->|"localhost:8081"| SMOL
        ORCH -->|"localhost:8082"| QWEN
    end

    subgraph volumes["Persistent Volumes"]
        MODELS[("Shared Models<br/>PVC: /models")]
    end

    SVC["Service<br/>router-mesh:8080"] --> ORCH
    SMOL -.-> MODELS
    QWEN -.-> MODELS

    subgraph hpa["Horizontal Pod Autoscaler"]
        HPA["Scale on:<br/>• CPU > 70%<br/>• Latency > 100ms"]
    end

    HPA -.->|"Auto-scale pods"| pod

    style ORCH fill:#30363d,stroke:#58a6ff,stroke-width:2px
    style SMOL fill:#30363d,stroke:#00d084,stroke-width:2px
    style QWEN fill:#30363d,stroke:#ff6900,stroke-width:2px
    style MODELS fill:#30363d,stroke:#9b51e0,stroke-width:2px
    style HPA fill:#30363d,stroke:#58a6ff,stroke-width:2px,stroke-dasharray: 5 5
  1. SmolLM server (port 8081) — L1 classification
  2. Qwen server (port 8082) — L2 classification
  3. Orchestrator (port 8080) — Coordinates the cascade, handles caching

Keeping all three in the same pod eliminates network latency between layers — communication happens over localhost.

Horizontal Pod Autoscaler

The HPA scales based on both CPU utilization and routing latency:

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: router_latency_seconds
      target:
        type: AverageValue
        averageValue: "100m"

When P95 latency exceeds 100ms, the cluster spins up additional router pods automatically.

Caching Layer

Redis caches routing decisions (not responses) to further reduce latency for repeated queries. The cache key is a hash of the incoming message:

cache_key = f"route:{hashlib.sha256(message.encode()).hexdigest()[:16]}"

With a 1-hour TTL, I’m seeing 40-70% cache hit rates depending on usage patterns. Cached routes return in under 5ms.

Observability

You can’t improve what you can’t measure. The router mesh exports Prometheus metrics for everything:

# Routing distribution by layer
sum(rate(router_decisions_total[5m])) by (route, layer)

# Escalation rate (target: <15%)
sum(rate(router_decisions_total{layer="escalated"}[5m]))
  / sum(rate(router_decisions_total[5m]))

# Cache efficiency (target: >50%)
sum(rate(router_cache_hits_total[5m]))
  / sum(rate(router_decisions_total[5m]))

# P95 latency by layer
histogram_quantile(0.95,
  sum(rate(router_latency_seconds_bucket[5m])) by (le, layer))

A Grafana dashboard visualizes these metrics in real-time, with alerts configured for anomalies.

Results

After deploying the router mesh to my k3s cluster:

MetricBeforeAfterImprovement
Median routing latency280ms25ms91% faster
P95 routing latency450ms75ms83% faster
Cloud API calls100% of requests~12% of requests88% reduction
Monthly API costHighLow88% savings

The system handles significantly higher throughput without breaking a sweat, and the user experience is noticeably snappier.

Lessons Learned

1. Small models are underrated

The AI discourse focuses on ever-larger models, but for classification tasks, tiny models are remarkably capable. SmolLM at 135M parameters correctly routes 80%+ of my traffic.

2. Confidence scores are essential

Without confidence scores, you’re flying blind. The cascade architecture only works because each layer can say “I’m not sure” and defer to the next.

3. Cache routing decisions, not responses

Responses go stale; routing decisions don’t. A request for “show connected clients” should always go to UniFi, regardless of what the current client list looks like.

4. Fail open, not closed

When something goes wrong — model timeout, parsing error, unexpected input — escalate to the smarter model rather than returning an error. Users should never see routing failures.

5. Observe everything from day one

Retrofitting observability is painful. Instrument your routing layer from the start, even if you’re not sure which metrics will matter.

What’s Next

The router mesh is now a core component of Cortex. Future improvements I’m exploring:

  • Adaptive thresholds: Automatically adjust confidence thresholds based on observed accuracy
  • Request embeddings: Use sentence-transformers for semantic similarity routing
  • Feedback loop: Let MCP servers report routing errors to improve the classifier
  • Fine-tuned models: Train SmolLM specifically on my routing vocabulary

Learn More About Cortex

Want to dive deeper into how Cortex works? Visit the Meet Cortex page to learn about its architecture, capabilities, and how it scales from 1 to 100+ agents on-demand.


Resources


Have questions or want to discuss the architecture? Find me on GitHub @ry-ops or check out my other projects at ry-ops.dev.

#LLM #Routing #Kubernetes #Cortex #Performance #Cost Optimization