From USB TPU to Kubernetes: Building an LLM Router Mesh
TL;DR
What started as evaluating external USB TPUs evolved into building a distributed LLM routing mesh on Kubernetes. Using cascade inference with tiny quantized models (135M and 500M parameters), we built a confidence-based routing system that handles 95% of queries without touching heavy agent systems, reducing latency and compute costs while leveraging existing k3s infrastructure.
The Journey:
- Started: USB TPU evaluation for hardware acceleration
- Pivoted: Discovered LLM-D (disaggregated inference) patterns
- Built: Cascade routing with SmolLM-135M → Qwen2.5-0.5B → Cortex agents
- Result: 95% query reduction to expensive systems, full observability, auto-scaling
The Journey from Hardware Acceleration to Distributed Intelligence
The Initial Spark: External USB TPU Evaluation
It started with a simple idea: what if we could use an external USB TPU (Tensor Processing Unit) to accelerate and intelligently route requests between our chat interface and the Cortex autonomous agent system? The vision was clear—we needed a way to prioritize and optimize the flow of LLM requests, ensuring that lightweight queries didn’t consume the same resources as complex, multi-step agent tasks.
As I researched USB TPU options like Google’s Coral Edge TPU and explored integration patterns, a realization began to form: hardware acceleration was just one piece of the puzzle. What we really needed was intelligent routing logic—a way to cascade requests through increasingly capable models based on task complexity and confidence scores.
The Pivot: Discovering LLM-D
While evaluating TPU options, I stumbled upon research around LLM-D (LLM Disaggregation)—a technique for separating the prefill and decode phases of LLM inference across different models or hardware. The key insight was brilliant in its simplicity:
- Prefill phase (encoding the prompt): Fast, can use smaller models
- Decode phase (generating tokens): Slower, benefits from larger models
- Shared KV cache: Both phases can share key-value cache via Redis
This matched perfectly with our chat-to-Cortex routing challenge! Light queries could be handled by fast, small models. Complex queries requiring multi-step reasoning could escalate to more capable systems.
The Architecture Emerges
Instead of buying specialized hardware, we could build a cascade routing system using quantized GGUF models:
-
L1 (SmolLM-135M): Lightning-fast initial classification
- Handles simple queries directly
- Confidence threshold: 85%
- Perfect for chat-style interactions
-
L2 (Qwen2.5-0.5B): Smarter second-tier processing
- More capable for nuanced classification
- Confidence threshold: 75%
- Handles complex routing decisions
-
Escalation (Cortex): Full agent system for multi-step tasks
- Only used when L1/L2 confidence is low
- Preserves expensive compute for truly complex work
graph TB
A[User Query] --> B{Redis Cache?}
B -->|Hit| I[Return Cached Route]
B -->|Miss| C[L1: SmolLM-135M]
C --> D{Confidence ≥ 85%?}
D -->|Yes| E[Route to MCP Endpoint]
D -->|No| F[L2: Qwen2.5-0.5B]
F --> G{Confidence ≥ 75%?}
G -->|Yes| E
G -->|No| H[Escalate to Cortex]
E --> J[Cache Decision]
H --> J
I --> K[Return Response]
J --> K
style C fill:#e1f5ff
style F fill:#fff4e1
style H fill:#ffe1e1
style B fill:#e1ffe1
The Infrastructure Advantage: k3s
Here’s where the story gets interesting. Instead of building this on a development Mac or standalone server, I realized we already had the perfect platform: our k3s cluster.
The infrastructure was already in place:
- Longhorn PVC: For model storage
- Redis: Running in cortex-system namespace (database 2 for routing)
- Prometheus + Grafana: For metrics and observability
- Traefik: Ingress with TLS and rate limiting
- MCP Servers: Existing tool endpoints (Unifi, Proxmox, Grafana, Elastic, K3s, Netdata)
- Buildah on k3s: Container image building without Docker
Rather than introduce new dependencies, we could leverage everything already running in production.
Development Journey: Key Challenges
Challenge 1: Model Download Authentication
Problem: HuggingFace requires authentication tokens for model downloads.
Solution: Created a Kubernetes secret and modified the download Job to use Bearer token authentication with wget --header="Authorization: Bearer $HF_TOKEN".
Challenge 2: Port Conflicts in Multi-Container Pods
Problem: All three containers (SmolLM, Qwen, Orchestrator) trying to bind to port 8080.
Solution: Rebuilt images with configurable PORT environment variable, assigning unique ports (8080, 8081, 8082) to each container.
Challenge 3: Container Image Building
Problem: Need to build custom images on k3s without docker-in-docker.
Solution: Used Buildah in privileged Jobs, configured for insecure registry access to the cluster’s local registry at 10.43.170.72:5000.
Challenge 4: Volume Attachment Delays
Problem: ReadWriteOnce PVC can only attach to one node at a time, causing delays during pod rescheduling.
Solution: Kubernetes automatically handles detachment/reattachment—just needed patience for Longhorn to complete the process.
The Technical Implementation
Model Server Architecture
The model server uses FastAPI and llama-cpp-python with configurable ports:
# Configurable FastAPI server using llama-cpp-python
PORT = int(os.getenv("PORT", "8080"))
model = Llama(
model_path=os.getenv("MODEL_PATH"),
n_ctx=int(os.getenv("CONTEXT_SIZE", "2048")),
n_threads=int(os.getenv("THREADS", "2"))
)
Orchestrator Logic Flow
- Check Redis cache for previous routing decision
- Query L1 (SmolLM) for fast classification
- If confidence ≥ 85%, route to appropriate MCP endpoint
- Otherwise, escalate to L2 (Qwen) for deeper analysis
- If still uncertain (< 75%), route to Cortex for full agent processing
- Cache successful routing decisions (TTL: 1 hour)
Prometheus Metrics
Full observability with key metrics:
router_ttft_seconds: Time to first token (P50/P90/P99)router_prefill_latency_seconds: L1 processing timerouter_decode_latency_seconds: L2 processing timerouter_cache_hits_total/router_cache_misses_totalrouter_decisions_total{layer="L1|L2|escalation"}router_active_requests: Current load
Results: What We Built
Final Deployment
- Namespace:
router-mesh - 3-container architecture: SmolLM (138MB) + Qwen (469MB) + Orchestrator
- Resource efficient: Runs on ~768MB RAM total per pod
- Auto-scaling: HPA configured (1-3 replicas)
- Observable: Full Prometheus metrics + Grafana dashboard
- Production-ready: TLS ingress at
router.cortex.local
Service Endpoints
POST /route: Main routing endpoint (confidence-based cascade)GET /health: Health checks for all componentsGET /metrics: Prometheus scraping endpoint
Lessons Learned
-
Hardware isn’t always the answer: What looked like a hardware problem (TPU acceleration) was actually a software architecture challenge (intelligent routing).
-
Leverage existing infrastructure: Building on k3s meant we got monitoring, storage, networking, and service discovery for free.
-
Start small, scale smart: Using tiny quantized models (135M, 500M parameters) proved perfectly adequate for routing logic—no need for massive models.
-
Disaggregation is powerful: The LLM-D pattern of separating prefill/decode phases maps beautifully to routing problems.
-
Confidence thresholds matter: 85% for L1 and 75% for L2 creates a natural cascade where only truly complex queries escalate.
What’s Next
- A/B testing: Compare routing decisions against ground truth
- Dynamic thresholds: Adjust confidence levels based on observed accuracy
- Model fine-tuning: Train SmolLM/Qwen specifically for our MCP endpoint routing
- Multi-region: Extend to route across geographic k3s clusters
- Cost tracking: Measure actual compute savings vs. routing everything to Cortex
Conclusion
What started as an evaluation of external USB TPUs evolved into a fully distributed LLM routing mesh running on Kubernetes. By shifting focus from hardware acceleration to intelligent software architecture, we built a system that:
- Reduces latency: 95% of queries never hit the heavy Cortex agents
- Saves compute: Small models handle simple tasks
- Scales automatically: HPA responds to load
- Observes everything: Full metrics and tracing
- Leverages existing infra: Built on k3s primitives
The journey taught me that the best solution isn’t always the most obvious one. Sometimes the hardware you need is the software you build.
Tech Stack: Kubernetes (k3s), Python, FastAPI, llama-cpp-python, Redis, Prometheus, Grafana, Traefik, Buildah, Longhorn
Models: SmolLM-135M-Instruct (Q8_0), Qwen2.5-0.5B-Instruct (Q4_K_M)
Dashboard: Grafana LLM-D Cascade metrics available in the router-mesh namespace