From Chaos to GitOps: How We Tamed 6,247 Files and Built a Self-Healing Infrastructure

TL;DR

Transformed 6,247 scattered files and 73 manually-deployed Kubernetes resources into a fully automated GitOps workflow in 90 minutes. Migrated everything to two repositories (cortex-gitops for infrastructure, cortex-platform for code), configured ArgoCD with 100% auto-sync and self-healing, and achieved zero-drift infrastructure where the cluster automatically reverts manual changes. The result: complete audit trail, instant rollbacks via git revert, disaster recovery capability, and a cluster that “thunders” in response to commits.

The transformation:

Before: 6,247 files scattered locally, 73 resources deployed manually, 100% cluster drift, no audit trail
After: 120 resources in GitOps, 7 ArgoCD applications, 100% auto-sync, self-healing enabled
Philosophy: “The control plane whispers; the cluster thunders”

The Problem: Development Sprawl

Like many infrastructure projects, Cortex started small. A few scripts here, some Kubernetes manifests there, maybe a quick kubectl apply to test something. Fast forward a few months, and we had accumulated:

6,247 files scattered across the local machine
73+ deployed resources with no source of truth
Manifests mixed with application code
Git repositories nested inside other git repositories
Documentation, configs, and code all jumbled together
Zero audit trail for infrastructure changes
No one knew what was actually running where

Every deployment was manual. Every change was kubectl apply. Every rollback was panic and prayer.

We needed to fix this. Badly.

The Vision: Control Plane vs Data Plane

We adopted a simple mantra:

“The control plane whispers; the cluster thunders.”

This philosophy would guide everything:

Control Plane (local machine): Plans, writes manifests, commits to Git
Data Plane (k3s cluster): Pulls from Git, executes workloads, enforces state

No more running code locally. No more manual kubectl apply. The control plane whispers instructions into Git, and the cluster thunders into action.

Phase 1: The Audit

First, we needed to understand what we actually had. We cataloged everything:

Local Machine:

688 YAML/YML files
563 Python files
4,964 JavaScript/TypeScript files
32 Dockerfiles
3 nested git repositories (!)

K3s Cluster:

20 namespaces
73 deployed resources
0 ArgoCD applications (ArgoCD was installed but completely unused)
Everything deployed via manual kubectl apply

The verdict: 100% cluster drift. Not a single resource was managed by GitOps.

Phase 2: The Plan

We needed two things:

1. cortex-gitops - Infrastructure as Code

A single source of truth for all Kubernetes manifests. ArgoCD would watch this repository and automatically sync changes to the cluster.

2. cortex-platform - Application Monorepo

All application code, libraries, and services in one place. Build containers from this, push to registry, reference in GitOps manifests.

The separation is critical:

cortex-gitops: WHAT to deploy (manifests)
cortex-platform: WHAT to build (code)

Phase 3: The Migration

Creating the Repositories

gh repo create ry-ops/cortex-gitops --private
gh repo create ry-ops/cortex-platform --private

Exporting Resources from the Cluster

We built a Python script to cleanly export resources:

def clean_manifest(manifest):
    """Remove cluster-specific fields"""
    if 'metadata' in manifest:
        for field in ['creationTimestamp', 'resourceVersion',
                      'uid', 'generation', 'managedFields']:
            manifest['metadata'].pop(field, None)
    manifest.pop('status', None)
    return manifest

def export_resource(kind, name, namespace, output_file):
    result = subprocess.run(
        ['kubectl', 'get', kind, name, '-n', namespace, '-o', 'json'],
        capture_output=True, text=True, check=True
    )
    manifest = json.loads(result.stdout)
    manifest = clean_manifest(manifest)

    with open(output_file, 'w') as f:
        yaml.dump(manifest, f, default_flow_style=False)

We started with critical infrastructure:

Redis (master, replicas, services)
PostgreSQL (statefulsets)
MCP servers (all 5 integrations)
Master agents (coordinator, development, security)
Queue system
Chat services

Then expanded to everything else. 120 resources exported, cleaned, and committed to cortex-gitops.

Directory Structure

cortex-gitops/
├── apps/
│   ├── cortex-system/     # 49 resources
│   ├── cortex/            # 16 resources
│   ├── cortex-chat/       # 17 resources
│   ├── cortex-dev/        # 8 resources
│   ├── cortex-cicd/       # 3 resources
│   ├── cortex-security/   # 12 resources
│   └── cortex-knowledge/  # 15 resources
├── argocd-apps/           # 7 Application definitions
└── README.md

Phase 4: ArgoCD Configuration

We created ArgoCD Application manifests for each namespace:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cortex-system
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/ry-ops/cortex-gitops.git
    targetRevision: main
    path: apps/cortex-system
  destination:
    server: https://kubernetes.default.svc
    namespace: cortex-system
  syncPolicy:
    automated:
      prune: true        # Remove deleted resources
      selfHeal: true     # Revert manual changes

Key configuration choices:

Auto-sync: ArgoCD checks GitHub every 3 minutes and syncs automatically
Self-heal: Manual kubectl changes are automatically reverted
Prune: Resources deleted from Git are deleted from cluster

Applied all 7 applications:

kubectl apply -f argocd-apps/

Result:

NAME               SYNC STATUS   HEALTH STATUS
cortex-system      Synced        Healthy
cortex-core        Synced        Healthy
cortex-chat        Synced        Healthy
cortex-dev         Synced        Healthy
cortex-cicd        Synced        Healthy
cortex-security    Synced        Healthy
cortex-knowledge   Synced        Healthy

100% synced. GitOps operational.

Phase 5: Code Migration

With infrastructure in cortex-gitops, we migrated all application code to cortex-platform:

cortex-platform/
├── services/
│   ├── mcp-servers/       # Proxmox, UniFi, Cloudflare integrations
│   ├── api/               # API services
│   └── workers/           # Background workers
├── lib/
│   ├── cortex-core/       # Core platform libraries
│   ├── orchestration/     # Orchestration logic
│   └── coordination/      # Agent coordination
├── coordination/
│   ├── masters/           # Master agent configs
│   ├── workers/           # Worker specs
│   └── policies/          # Coordination policies
├── docs/                  # Documentation
├── testing/               # Test suites
└── scripts/               # Build scripts

5,476 files migrated. 807,013 lines of code. One commit:

git add .
git commit -m "Initial import: Migrate all Cortex code to platform monorepo"
git push origin main

The New Workflow

Before (Chaos):

# Developer machine
vim some-service.py
docker build -t my-service .
kubectl apply -f manifest.yaml

# Result:
# - No audit trail
# - No version control for deployments
# - Drift everywhere
# - Can't rollback easily
# - No idea what's actually running

After (GitOps):

# 1. Code change
vim ~/cortex-platform/services/api/cache.ts
cd ~/cortex-platform
git add .
git commit -m "Add Redis caching to API"
git push origin main

# 2. (Future: CI/CD builds container automatically)

# 3. Infrastructure change
vim ~/cortex-gitops/apps/cortex/api-deployment.yaml
# Update image tag to new version
cd ~/cortex-gitops
git add .
git commit -m "Deploy API v2.1 with caching"
git push origin main

# 4. ArgoCD syncs automatically (within 3 minutes)
# Cluster pulls changes and deploys

# Result:
# ✅ Full audit trail (Git history)
# ✅ Version controlled infrastructure
# ✅ Easy rollback (git revert)
# ✅ Self-healing (manual changes reverted)
# ✅ Single source of truth

The Architecture

┌─────────────────────────────────────────────────┐
│          CONTROL PLANE (Local Machine)          │
│                                                  │
│  • Plans and designs                            │
│  • Writes manifests                             │
│  • Commits to Git                               │
│  • NEVER executes workloads                     │
│                                                  │
│  "The control plane whispers..."                │
└─────────────────────────────────────────────────┘
                     │
                     │ Git Push
                     ▼
┌─────────────────────────────────────────────────┐
│              GITHUB REPOSITORIES                 │
│                                                  │
│  cortex-gitops: 121 YAML manifests              │
│  cortex-platform: 5,476 source files            │
│                                                  │
│  Single source of truth                         │
│  Version controlled • Auditable                 │
└─────────────────────────────────────────────────┘
                     │
                     │ ArgoCD Watches
                     ▼
┌─────────────────────────────────────────────────┐
│               ARGOCD (in k3s)                    │
│                                                  │
│  • Polls GitHub every 3 minutes                 │
│  • Detects changes automatically                │
│  • Syncs to cluster                             │
│  • Enforces desired state                       │
│  • Reverts manual changes                       │
└─────────────────────────────────────────────────┘
                     │
                     │ Deploys
                     ▼
┌─────────────────────────────────────────────────┐
│           K3S CLUSTER (7 nodes)                  │
│                                                  │
│  • 120 resources managed by GitOps              │
│  • Auto-sync enabled                            │
│  • Self-healing active                          │
│  • All workloads execute here                   │
│                                                  │
│  "...the cluster thunders."                     │
└─────────────────────────────────────────────────┘

Benefits Realized

1. Audit Trail

Every infrastructure change is a Git commit with:

Who made the change
When it was made
Why it was made (commit message)
Exactly what changed (diff)

2. Easy Rollback

# See history
git log --oneline

# Rollback to previous version
git revert <commit-hash>
git push origin main

# ArgoCD syncs the rollback automatically

3. Self-Healing

Someone does kubectl scale deployment my-service --replicas=10? ArgoCD notices within 3 minutes and reverts it to the Git-defined state.

4. No More Drift

The cluster’s state ALWAYS matches Git. If it doesn’t, ArgoCD fixes it automatically.

5. Disaster Recovery

Cluster destroyed?

# Point new ArgoCD at cortex-gitops
# Everything restores automatically

6. Review Gates (Optional)

# Don't push directly to main
# Use PRs for changes
git checkout -b add-caching
# Make changes
git push origin add-caching
# Create PR, get approval
# Merge triggers ArgoCD sync

Lessons Learned

1. Start with Critical Infrastructure

We didn’t migrate all 120 resources at once. We started with:

Redis
PostgreSQL
Core services
MCP servers

Got those working, validated the workflow, then expanded.

2. Disable Auto-Sync Initially

We started with manual sync to validate manifests were correct. Once confident, enabled auto-sync across all applications.

3. Clean Your Exports

Remove cluster-specific fields from manifests:

metadata.creationTimestamp
metadata.resourceVersion
metadata.uid
status (entire section)

4. Watch for Secrets

GitHub secret scanning caught an API key in our manifests. Good reminder:

Don’t commit secrets to Git
Use Kubernetes Secrets
Reference secrets in deployments

5. Nested Git Repos Are Pain

Found 3 nested git repositories during migration. Caused issues. Flatten them:

rm -rf nested-repo/.git
git add nested-repo/

Metrics

Metric	Before	After
Files tracked	0	6,247
Resources in GitOps	0	120
Namespaces managed	0	7
Manual deploys	100%	0%
Audit trail	None	Full Git history
Rollback capability	Manual panic	`git revert`
Drift detection	None	Automatic
Time to production	Hours	Minutes

The Results

Time invested: ~90 minutes Resources migrated: 120 Code migrated: 5,476 files ArgoCD apps created: 7 Sync status: 100% Auto-sync enabled: 100% Self-heal enabled: 100%

From chaos to GitOps in an afternoon.

What’s Next

Immediate:

Set up CI/CD for cortex-platform (Tekton/GitHub Actions)
Build container images automatically on code push
Push images to internal registry
Update GitOps manifests with new image tags

Short Term:

Multi-environment setup (dev/staging/prod branches)
PR-based workflow with approval gates
Automated testing in CI pipeline
Observability for GitOps (dashboards, alerts)

Long Term:

Progressive delivery (canary, blue-green)
Multi-cluster GitOps
Policy enforcement (OPA, Kyverno)
Automated security scanning

Conclusion

Going from scattered files and manual deploys to a fully automated GitOps workflow wasn’t just about tools. It was about adopting a philosophy:

The control plane whispers; the cluster thunders.

We separated concerns:

Planning happens locally (whisper)
Execution happens on the cluster (thunder)
Git is the bridge between them

ArgoCD enforces this separation. We can’t cheat. We can’t kubectl apply our way out of a problem. Every change goes through Git, gets audited, and can be rolled back.

The cluster is now self-healing, drift-free, and fully automated. Infrastructure changes are as simple as committing to Git. The cluster pulls and deploys automatically.

We took 6,247 files of chaos and turned them into a thundering, self-managing infrastructure platform.

And you know what? It feels good to let the cluster thunder.

Resources

cortex-gitops: Infrastructure manifests
cortex-platform: Application code
ArgoCD: GitOps operator
K3s: Lightweight Kubernetes (7-node cluster)
Project directive: CLAUDE.md v2.1.0

About

This transformation was completed as “Project Thunder” - a comprehensive migration from development sprawl to GitOps-controlled infrastructure. The entire process, from audit to full automation, took approximately 90 minutes of focused work.

The result: 120 resources under GitOps control, 7 ArgoCD applications, 100% auto-sync and self-heal enabled, and a cluster that enforces its own desired state.

The control plane whispers; the cluster thunders.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Watching Infrastructure Learn From Itself: A Claude Code Reflection

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Concept: Homomorphic encryption techniques for secure computation on encrypted data