From USB TPU to Kubernetes: Building an LLM Router Mesh

TL;DR

What started as evaluating external USB TPUs evolved into building a distributed LLM routing mesh on Kubernetes. Using cascade inference with tiny quantized models (135M and 500M parameters), we built a confidence-based routing system that handles 95% of queries without touching heavy agent systems, reducing latency and compute costs while leveraging existing k3s infrastructure.

The Journey:

Started: USB TPU evaluation for hardware acceleration
Pivoted: Discovered LLM-D (disaggregated inference) patterns
Built: Cascade routing with SmolLM-135M → Qwen2.5-0.5B → Cortex agents
Result: 95% query reduction to expensive systems, full observability, auto-scaling

The Journey from Hardware Acceleration to Distributed Intelligence

The Initial Spark: External USB TPU Evaluation

It started with a simple idea: what if we could use an external USB TPU (Tensor Processing Unit) to accelerate and intelligently route requests between our chat interface and the Cortex autonomous agent system? The vision was clear—we needed a way to prioritize and optimize the flow of LLM requests, ensuring that lightweight queries didn’t consume the same resources as complex, multi-step agent tasks.

As I researched USB TPU options like Google’s Coral Edge TPU and explored integration patterns, a realization began to form: hardware acceleration was just one piece of the puzzle. What we really needed was intelligent routing logic—a way to cascade requests through increasingly capable models based on task complexity and confidence scores.

The Pivot: Discovering LLM-D

While evaluating TPU options, I stumbled upon research around LLM-D (LLM Disaggregation)—a technique for separating the prefill and decode phases of LLM inference across different models or hardware. The key insight was brilliant in its simplicity:

Prefill phase (encoding the prompt): Fast, can use smaller models
Decode phase (generating tokens): Slower, benefits from larger models
Shared KV cache: Both phases can share key-value cache via Redis

This matched perfectly with our chat-to-Cortex routing challenge! Light queries could be handled by fast, small models. Complex queries requiring multi-step reasoning could escalate to more capable systems.

The Architecture Emerges

Instead of buying specialized hardware, we could build a cascade routing system using quantized GGUF models:

L1 (SmolLM-135M): Lightning-fast initial classification
- Handles simple queries directly
- Confidence threshold: 85%
- Perfect for chat-style interactions
L2 (Qwen2.5-0.5B): Smarter second-tier processing
- More capable for nuanced classification
- Confidence threshold: 75%
- Handles complex routing decisions
Escalation (Cortex): Full agent system for multi-step tasks
- Only used when L1/L2 confidence is low
- Preserves expensive compute for truly complex work

graph TB
    A[User Query] --> B{Redis Cache?}
    B -->|Hit| I[Return Cached Route]
    B -->|Miss| C[L1: SmolLM-135M]

    C --> D{Confidence ≥ 85%?}
    D -->|Yes| E[Route to MCP Endpoint]
    D -->|No| F[L2: Qwen2.5-0.5B]

    F --> G{Confidence ≥ 75%?}
    G -->|Yes| E
    G -->|No| H[Escalate to Cortex]

    E --> J[Cache Decision]
    H --> J
    I --> K[Return Response]
    J --> K

    style C fill:#e1f5ff
    style F fill:#fff4e1
    style H fill:#ffe1e1
    style B fill:#e1ffe1

The Infrastructure Advantage: k3s

Here’s where the story gets interesting. Instead of building this on a development Mac or standalone server, I realized we already had the perfect platform: our k3s cluster.

The infrastructure was already in place:

Longhorn PVC: For model storage
Redis: Running in cortex-system namespace (database 2 for routing)
Prometheus + Grafana: For metrics and observability
Traefik: Ingress with TLS and rate limiting
MCP Servers: Existing tool endpoints (Unifi, Proxmox, Grafana, Elastic, K3s, Netdata)
Buildah on k3s: Container image building without Docker

Rather than introduce new dependencies, we could leverage everything already running in production.

Development Journey: Key Challenges

Challenge 1: Model Download Authentication

Problem: HuggingFace requires authentication tokens for model downloads.

Solution: Created a Kubernetes secret and modified the download Job to use Bearer token authentication with wget --header="Authorization: Bearer $HF_TOKEN".

Challenge 2: Port Conflicts in Multi-Container Pods

Problem: All three containers (SmolLM, Qwen, Orchestrator) trying to bind to port 8080.

Solution: Rebuilt images with configurable PORT environment variable, assigning unique ports (8080, 8081, 8082) to each container.

Challenge 3: Container Image Building

Problem: Need to build custom images on k3s without docker-in-docker.

Solution: Used Buildah in privileged Jobs, configured for insecure registry access to the cluster’s local registry at 10.43.170.72:5000.

Challenge 4: Volume Attachment Delays

Problem: ReadWriteOnce PVC can only attach to one node at a time, causing delays during pod rescheduling.

Solution: Kubernetes automatically handles detachment/reattachment—just needed patience for Longhorn to complete the process.

The Technical Implementation

Model Server Architecture

The model server uses FastAPI and llama-cpp-python with configurable ports:

# Configurable FastAPI server using llama-cpp-python
PORT = int(os.getenv("PORT", "8080"))
model = Llama(
    model_path=os.getenv("MODEL_PATH"),
    n_ctx=int(os.getenv("CONTEXT_SIZE", "2048")),
    n_threads=int(os.getenv("THREADS", "2"))
)

Orchestrator Logic Flow

Check Redis cache for previous routing decision
Query L1 (SmolLM) for fast classification
If confidence ≥ 85%, route to appropriate MCP endpoint
Otherwise, escalate to L2 (Qwen) for deeper analysis
If still uncertain (< 75%), route to Cortex for full agent processing
Cache successful routing decisions (TTL: 1 hour)

Prometheus Metrics

Full observability with key metrics:

router_ttft_seconds: Time to first token (P50/P90/P99)
router_prefill_latency_seconds: L1 processing time
router_decode_latency_seconds: L2 processing time
router_cache_hits_total / router_cache_misses_total
router_decisions_total{layer="L1|L2|escalation"}
router_active_requests: Current load

Results: What We Built

Final Deployment

Namespace: router-mesh
3-container architecture: SmolLM (138MB) + Qwen (469MB) + Orchestrator
Resource efficient: Runs on ~768MB RAM total per pod
Auto-scaling: HPA configured (1-3 replicas)
Observable: Full Prometheus metrics + Grafana dashboard
Production-ready: TLS ingress at router.cortex.local

Service Endpoints

POST /route: Main routing endpoint (confidence-based cascade)
GET /health: Health checks for all components
GET /metrics: Prometheus scraping endpoint

Lessons Learned

Hardware isn’t always the answer: What looked like a hardware problem (TPU acceleration) was actually a software architecture challenge (intelligent routing).
Leverage existing infrastructure: Building on k3s meant we got monitoring, storage, networking, and service discovery for free.
Start small, scale smart: Using tiny quantized models (135M, 500M parameters) proved perfectly adequate for routing logic—no need for massive models.
Disaggregation is powerful: The LLM-D pattern of separating prefill/decode phases maps beautifully to routing problems.
Confidence thresholds matter: 85% for L1 and 75% for L2 creates a natural cascade where only truly complex queries escalate.

What’s Next

A/B testing: Compare routing decisions against ground truth
Dynamic thresholds: Adjust confidence levels based on observed accuracy
Model fine-tuning: Train SmolLM/Qwen specifically for our MCP endpoint routing
Multi-region: Extend to route across geographic k3s clusters
Cost tracking: Measure actual compute savings vs. routing everything to Cortex

Conclusion

What started as an evaluation of external USB TPUs evolved into a fully distributed LLM routing mesh running on Kubernetes. By shifting focus from hardware acceleration to intelligent software architecture, we built a system that:

Reduces latency: 95% of queries never hit the heavy Cortex agents
Saves compute: Small models handle simple tasks
Scales automatically: HPA responds to load
Observes everything: Full metrics and tracing
Leverages existing infra: Built on k3s primitives

The journey taught me that the best solution isn’t always the most obvious one. Sometimes the hardware you need is the software you build.

Tech Stack: Kubernetes (k3s), Python, FastAPI, llama-cpp-python, Redis, Prometheus, Grafana, Traefik, Buildah, Longhorn

Models: SmolLM-135M-Instruct (Q8_0), Qwen2.5-0.5B-Instruct (Q4_K_M)

Dashboard: Grafana LLM-D Cascade metrics available in the router-mesh namespace

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Watching Infrastructure Learn From Itself: A Claude Code Reflection

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Concept: Homomorphic encryption techniques for secure computation on encrypted data