Cleaning House: Migrating a 90-Deployment k3s Cluster to fabric-forge

There is a moment in every homelab’s lifecycle where the cluster stops being a tool and starts being a museum. You deploy something experimental. It works, so you leave it running. Then you deploy something else next to it. Repeat for two years. Eventually you have 90 deployments spread across 80 namespaces, half of which you cannot explain without checking git blame.

My k3s cluster hit that moment. And the decision was not whether to clean it up — it was how to do it without breaking the things that actually matter.

The Backstory

The cluster started under the cortex-io organization. Cortex was the original moniker for the homelab automation stack — a collection of services for managing infrastructure, running AI pipelines, and orchestrating Kubernetes workloads. It worked. But as the architecture evolved into what became fabric-forge, the old cortex namespaces and deployments stayed behind like furniture from a previous tenant.

The fabric-forge stack is lean by design. fabric-gateway routes all MCP tool calls. fabric-chat handles conversation sessions. fabric-k8s provides cluster introspection. fabric-pipelines runs five CronJob automations. Behind them: Redis, Qdrant, and Postgres for state. That is the entire application layer. Seven services and three data stores.

The rest of the 90 deployments? Legacy. An internal Docker registry that fabric does not use (it pulls from GHCR). A KEDA HTTP add-on that has been crash-looping for weeks with 1,776 restarts. Linkerd, installed but never wired into fabric. Tekton, replaced by GitHub Actions. n8n, netbox, nginx-proxy-manager, portainer, velero, vpa — all artifacts from an era when the answer to every problem was “deploy another thing.”

What Stays

The infrastructure layer is not the problem. Traefik and MetalLB handle ingress. cert-manager handles TLS. Longhorn handles persistent storage. Prometheus and Grafana handle observability. Sandfly handles security monitoring. Tailscale handles network access. ArgoCD manages the gitops loop. All of these stay.

The fabric stack stays, obviously. Gateway, chat, k8s, pipelines, plus the backing data stores. Everything lives in cortex-system — a namespace name that predates the rebrand but works fine as a home for the active workloads.

There are seven Longhorn PVCs attached to the fabric stack: Redis master and three replicas, Qdrant storage, Postgres data, and a pair of volumes for an in-cluster terminal. None of these can be touched during the migration. Losing a PVC means losing data. The cluster migration is about removing what surrounds the data, not what holds it.

What Goes

The decommission list breaks down by risk.

The lowest-risk removals are the crash-looping services and the obvious orphans. The KEDA HTTP add-on, the internal Docker registry, cortex-orchestrator (the last thing still referencing that internal registry), and the unifi-reasoning-slm deployment that cannot even schedule because it requests 3GB of memory on nodes already running at 89-98% utilization. None of these are connected to anything fabric uses. Deleting them is pure cleanup.

The medium-risk removals are the old cortex namespaces themselves. There are roughly 80 of them. Every workload is backed up in the cortex-io GitHub org, so nothing is truly lost. But the risk is not data loss — it is dependency surprise. A cortex namespace might contain a service that something in the fabric stack quietly depends on. The only way to know is to check, namespace by namespace, before deleting.

The highest-risk removal is Rancher. Not because Rancher itself is important — only the fleet-agent is installed, and fabric does not use it. The risk is that Rancher currently manages sealed-secrets, Traefik, and kube-prometheus-stack through its own Helm lifecycle. Removing Rancher without first re-adopting those charts into ArgoCD means losing management of critical infrastructure. This has to happen last, and only after the cluster has been stable for at least a week with everything else cleaned up.

The Order of Operations

The migration happens in five phases, ordered from safest to riskiest.

Phase zero is the immediate safe cleanup: delete the crash-looping KEDA add-on, create a missing ArgoCD GPG configmap that has been generating 24,000 FailedMount events, suspend the unschedulable workloads, and remove cortex-orchestrator. Everything in phase zero is reversible with a single kubectl apply.

Phase one is additive — building the execution dispatch layer that fabric pipelines will use to propose and apply changes through git rather than ad-hoc kubectl. A kubectl-apply.yml workflow and a helm-upgrade.yml workflow in the gitops repo, wired into the pipeline system. This is the mechanism that turns fabric from a monitoring system into an operations system. It has to exist before we start removing things, because it is how we recover if something goes wrong.

Phase two is the cortex namespace teardown. Safest namespaces first — n8n, netbox, portainer, the utility tools. Then the remaining cortex-specific namespaces. After each removal: check ArgoCD, check warning events, verify the fabric stack is healthy.

Phase three is Rancher removal. First, verify ArgoCD has its own copies of every Helm chart Rancher currently manages. Run kubectl get managedchart -A to see the full list. Re-adopt each resource with ArgoCD annotations. Only then uninstall Rancher and clean up its namespaces.

Phase four is the final cleanup: remove Linkerd and Tekton. Both are installed but completely unused. Low risk, low reward, but worth doing for a clean kubectl get ns output.

The Risk Calculus

There are four categories of data at risk during this migration.

Chat sessions and embeddings live in git-steer-state and Qdrant. Git is backed up by being git. The Qdrant vectors have no backup — if the PVC dies, we rebuild from the blog content. Inconvenient but not catastrophic.

Redis is a cache. It rebuilds itself on use. Losing it means a cold start, not data loss.

Postgres holds session metadata. No backup exists today. Before touching any PVC-adjacent operation, we need a pg_dump snapshot through Longhorn or a kubectl exec.

Sandfly’s Postgres holds scan history. Also no backup, but Sandfly rescans automatically. The history is nice to have, not essential.

The guiding principle is git-steer’s zero-footprint model: state lives in git, GitHub Actions execute changes, the cluster converges through ArgoCD. Every phase of this migration follows that principle. No cowboy kubectl. No manual drift. If it is not in a git commit, it does not happen.

What the Cluster Looks Like After

After the migration, the cluster topology is clean enough to draw on a napkin.

Networking: Traefik, MetalLB, cert-manager, Tailscale. Storage: Longhorn with only the fabric PVCs. Observability: Prometheus, Grafana, KEDA core. Security: Sandfly. GitOps: ArgoCD. Application layer: fabric-gateway, fabric-chat, fabric-k8s, fabric-pipelines, Redis, Qdrant, Postgres.

That is it. No orphaned namespaces. No crash-looping sidecars. No mystery deployments from 2024. A cluster that you can reason about without a spreadsheet.

The migration is not about reducing resource usage, although that will happen. It is about making the cluster honest — making kubectl get all -A reflect what is actually running, what actually matters, and what the gitops repo says should exist. A cluster that matches its own description is a cluster you can trust.

And trust is the prerequisite for everything that comes next.

Cleaning House: Migrating a 90-Deployment k3s Cluster to fabric-forge

The Backstory

What Stays

What Goes

The Order of Operations

The Risk Calculus

What the Cluster Looks Like After

Written by

Related posts

Explore more from ry-ops

unifi-mcp-server

proxmox-mcp-server

cloudflare-mcp-server

git-steer

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Cleaning House: Migrating a 90-Deployment k3s Cluster to fabric-forge

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Obstacles to Teammates: How Automation Built Itself a Better Partner

Open Source

Git-Steer Can Contribute to Other People's Repos Too

Security

What the IBM X-Force Report Taught Us About Securing Our Own Tools

The Backstory

What Stays

What Goes

The Order of Operations

The Risk Calculus

What the Cluster Looks Like After

Written by

Related posts

From Chaos to GitOps: How We Tamed 6,247 Files and Built a Self-Healing Infrastructure

From Good to Great: A Kubernetes Infrastructure Transformation

Brother-Assisted Multi-Environment Deployment: Cortex Goes Distributed

Explore more from ry-ops

unifi-mcp-server

proxmox-mcp-server

cloudflare-mcp-server

git-steer