From Chaos to GitOps: How We Tamed 6,247 Files and Built a Self-Healing Infrastructure
TL;DR
Transformed 6,247 scattered files and 73 manually-deployed Kubernetes resources into a fully automated GitOps workflow in 90 minutes. Migrated everything to two repositories (cortex-gitops for infrastructure, cortex-platform for code), configured ArgoCD with 100% auto-sync and self-healing, and achieved zero-drift infrastructure where the cluster automatically reverts manual changes. The result: complete audit trail, instant rollbacks via git revert, disaster recovery capability, and a cluster that “thunders” in response to commits.
The transformation:
- Before: 6,247 files scattered locally, 73 resources deployed manually, 100% cluster drift, no audit trail
- After: 120 resources in GitOps, 7 ArgoCD applications, 100% auto-sync, self-healing enabled
- Philosophy: “The control plane whispers; the cluster thunders”
The Problem: Development Sprawl
Like many infrastructure projects, Cortex started small. A few scripts here, some Kubernetes manifests there, maybe a quick kubectl apply to test something. Fast forward a few months, and we had accumulated:
- 6,247 files scattered across the local machine
- 73+ deployed resources with no source of truth
- Manifests mixed with application code
- Git repositories nested inside other git repositories
- Documentation, configs, and code all jumbled together
- Zero audit trail for infrastructure changes
- No one knew what was actually running where
Every deployment was manual. Every change was kubectl apply. Every rollback was panic and prayer.
We needed to fix this. Badly.
The Vision: Control Plane vs Data Plane
We adopted a simple mantra:
“The control plane whispers; the cluster thunders.”
This philosophy would guide everything:
- Control Plane (local machine): Plans, writes manifests, commits to Git
- Data Plane (k3s cluster): Pulls from Git, executes workloads, enforces state
No more running code locally. No more manual kubectl apply. The control plane whispers instructions into Git, and the cluster thunders into action.
Phase 1: The Audit
First, we needed to understand what we actually had. We cataloged everything:
Local Machine:
- 688 YAML/YML files
- 563 Python files
- 4,964 JavaScript/TypeScript files
- 32 Dockerfiles
- 3 nested git repositories (!)
K3s Cluster:
- 20 namespaces
- 73 deployed resources
- 0 ArgoCD applications (ArgoCD was installed but completely unused)
- Everything deployed via manual
kubectl apply
The verdict: 100% cluster drift. Not a single resource was managed by GitOps.
Phase 2: The Plan
We needed two things:
1. cortex-gitops - Infrastructure as Code
A single source of truth for all Kubernetes manifests. ArgoCD would watch this repository and automatically sync changes to the cluster.
2. cortex-platform - Application Monorepo
All application code, libraries, and services in one place. Build containers from this, push to registry, reference in GitOps manifests.
The separation is critical:
- cortex-gitops: WHAT to deploy (manifests)
- cortex-platform: WHAT to build (code)
Phase 3: The Migration
Creating the Repositories
gh repo create ry-ops/cortex-gitops --private
gh repo create ry-ops/cortex-platform --private
Exporting Resources from the Cluster
We built a Python script to cleanly export resources:
def clean_manifest(manifest):
"""Remove cluster-specific fields"""
if 'metadata' in manifest:
for field in ['creationTimestamp', 'resourceVersion',
'uid', 'generation', 'managedFields']:
manifest['metadata'].pop(field, None)
manifest.pop('status', None)
return manifest
def export_resource(kind, name, namespace, output_file):
result = subprocess.run(
['kubectl', 'get', kind, name, '-n', namespace, '-o', 'json'],
capture_output=True, text=True, check=True
)
manifest = json.loads(result.stdout)
manifest = clean_manifest(manifest)
with open(output_file, 'w') as f:
yaml.dump(manifest, f, default_flow_style=False)
We started with critical infrastructure:
- Redis (master, replicas, services)
- PostgreSQL (statefulsets)
- MCP servers (all 5 integrations)
- Master agents (coordinator, development, security)
- Queue system
- Chat services
Then expanded to everything else. 120 resources exported, cleaned, and committed to cortex-gitops.
Directory Structure
cortex-gitops/
├── apps/
│ ├── cortex-system/ # 49 resources
│ ├── cortex/ # 16 resources
│ ├── cortex-chat/ # 17 resources
│ ├── cortex-dev/ # 8 resources
│ ├── cortex-cicd/ # 3 resources
│ ├── cortex-security/ # 12 resources
│ └── cortex-knowledge/ # 15 resources
├── argocd-apps/ # 7 Application definitions
└── README.md
Phase 4: ArgoCD Configuration
We created ArgoCD Application manifests for each namespace:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cortex-system
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/ry-ops/cortex-gitops.git
targetRevision: main
path: apps/cortex-system
destination:
server: https://kubernetes.default.svc
namespace: cortex-system
syncPolicy:
automated:
prune: true # Remove deleted resources
selfHeal: true # Revert manual changes
Key configuration choices:
- Auto-sync: ArgoCD checks GitHub every 3 minutes and syncs automatically
- Self-heal: Manual
kubectlchanges are automatically reverted - Prune: Resources deleted from Git are deleted from cluster
Applied all 7 applications:
kubectl apply -f argocd-apps/
Result:
NAME SYNC STATUS HEALTH STATUS
cortex-system Synced Healthy
cortex-core Synced Healthy
cortex-chat Synced Healthy
cortex-dev Synced Healthy
cortex-cicd Synced Healthy
cortex-security Synced Healthy
cortex-knowledge Synced Healthy
100% synced. GitOps operational.
Phase 5: Code Migration
With infrastructure in cortex-gitops, we migrated all application code to cortex-platform:
cortex-platform/
├── services/
│ ├── mcp-servers/ # Proxmox, UniFi, Cloudflare integrations
│ ├── api/ # API services
│ └── workers/ # Background workers
├── lib/
│ ├── cortex-core/ # Core platform libraries
│ ├── orchestration/ # Orchestration logic
│ └── coordination/ # Agent coordination
├── coordination/
│ ├── masters/ # Master agent configs
│ ├── workers/ # Worker specs
│ └── policies/ # Coordination policies
├── docs/ # Documentation
├── testing/ # Test suites
└── scripts/ # Build scripts
5,476 files migrated. 807,013 lines of code. One commit:
git add .
git commit -m "Initial import: Migrate all Cortex code to platform monorepo"
git push origin main
The New Workflow
Before (Chaos):
# Developer machine
vim some-service.py
docker build -t my-service .
kubectl apply -f manifest.yaml
# Result:
# - No audit trail
# - No version control for deployments
# - Drift everywhere
# - Can't rollback easily
# - No idea what's actually running
After (GitOps):
# 1. Code change
vim ~/cortex-platform/services/api/cache.ts
cd ~/cortex-platform
git add .
git commit -m "Add Redis caching to API"
git push origin main
# 2. (Future: CI/CD builds container automatically)
# 3. Infrastructure change
vim ~/cortex-gitops/apps/cortex/api-deployment.yaml
# Update image tag to new version
cd ~/cortex-gitops
git add .
git commit -m "Deploy API v2.1 with caching"
git push origin main
# 4. ArgoCD syncs automatically (within 3 minutes)
# Cluster pulls changes and deploys
# Result:
# ✅ Full audit trail (Git history)
# ✅ Version controlled infrastructure
# ✅ Easy rollback (git revert)
# ✅ Self-healing (manual changes reverted)
# ✅ Single source of truth
The Architecture
┌─────────────────────────────────────────────────┐
│ CONTROL PLANE (Local Machine) │
│ │
│ • Plans and designs │
│ • Writes manifests │
│ • Commits to Git │
│ • NEVER executes workloads │
│ │
│ "The control plane whispers..." │
└─────────────────────────────────────────────────┘
│
│ Git Push
▼
┌─────────────────────────────────────────────────┐
│ GITHUB REPOSITORIES │
│ │
│ cortex-gitops: 121 YAML manifests │
│ cortex-platform: 5,476 source files │
│ │
│ Single source of truth │
│ Version controlled • Auditable │
└─────────────────────────────────────────────────┘
│
│ ArgoCD Watches
▼
┌─────────────────────────────────────────────────┐
│ ARGOCD (in k3s) │
│ │
│ • Polls GitHub every 3 minutes │
│ • Detects changes automatically │
│ • Syncs to cluster │
│ • Enforces desired state │
│ • Reverts manual changes │
└─────────────────────────────────────────────────┘
│
│ Deploys
▼
┌─────────────────────────────────────────────────┐
│ K3S CLUSTER (7 nodes) │
│ │
│ • 120 resources managed by GitOps │
│ • Auto-sync enabled │
│ • Self-healing active │
│ • All workloads execute here │
│ │
│ "...the cluster thunders." │
└─────────────────────────────────────────────────┘
Benefits Realized
1. Audit Trail
Every infrastructure change is a Git commit with:
- Who made the change
- When it was made
- Why it was made (commit message)
- Exactly what changed (diff)
2. Easy Rollback
# See history
git log --oneline
# Rollback to previous version
git revert <commit-hash>
git push origin main
# ArgoCD syncs the rollback automatically
3. Self-Healing
Someone does kubectl scale deployment my-service --replicas=10? ArgoCD notices within 3 minutes and reverts it to the Git-defined state.
4. No More Drift
The cluster’s state ALWAYS matches Git. If it doesn’t, ArgoCD fixes it automatically.
5. Disaster Recovery
Cluster destroyed?
# Point new ArgoCD at cortex-gitops
# Everything restores automatically
6. Review Gates (Optional)
# Don't push directly to main
# Use PRs for changes
git checkout -b add-caching
# Make changes
git push origin add-caching
# Create PR, get approval
# Merge triggers ArgoCD sync
Lessons Learned
1. Start with Critical Infrastructure
We didn’t migrate all 120 resources at once. We started with:
- Redis
- PostgreSQL
- Core services
- MCP servers
Got those working, validated the workflow, then expanded.
2. Disable Auto-Sync Initially
We started with manual sync to validate manifests were correct. Once confident, enabled auto-sync across all applications.
3. Clean Your Exports
Remove cluster-specific fields from manifests:
metadata.creationTimestampmetadata.resourceVersionmetadata.uidstatus(entire section)
4. Watch for Secrets
GitHub secret scanning caught an API key in our manifests. Good reminder:
- Don’t commit secrets to Git
- Use Kubernetes Secrets
- Reference secrets in deployments
5. Nested Git Repos Are Pain
Found 3 nested git repositories during migration. Caused issues. Flatten them:
rm -rf nested-repo/.git
git add nested-repo/
Metrics
| Metric | Before | After |
|---|---|---|
| Files tracked | 0 | 6,247 |
| Resources in GitOps | 0 | 120 |
| Namespaces managed | 0 | 7 |
| Manual deploys | 100% | 0% |
| Audit trail | None | Full Git history |
| Rollback capability | Manual panic | git revert |
| Drift detection | None | Automatic |
| Time to production | Hours | Minutes |
The Results
Time invested: ~90 minutes Resources migrated: 120 Code migrated: 5,476 files ArgoCD apps created: 7 Sync status: 100% Auto-sync enabled: 100% Self-heal enabled: 100%
From chaos to GitOps in an afternoon.
What’s Next
Immediate:
- Set up CI/CD for cortex-platform (Tekton/GitHub Actions)
- Build container images automatically on code push
- Push images to internal registry
- Update GitOps manifests with new image tags
Short Term:
- Multi-environment setup (dev/staging/prod branches)
- PR-based workflow with approval gates
- Automated testing in CI pipeline
- Observability for GitOps (dashboards, alerts)
Long Term:
- Progressive delivery (canary, blue-green)
- Multi-cluster GitOps
- Policy enforcement (OPA, Kyverno)
- Automated security scanning
Conclusion
Going from scattered files and manual deploys to a fully automated GitOps workflow wasn’t just about tools. It was about adopting a philosophy:
The control plane whispers; the cluster thunders.
We separated concerns:
- Planning happens locally (whisper)
- Execution happens on the cluster (thunder)
- Git is the bridge between them
ArgoCD enforces this separation. We can’t cheat. We can’t kubectl apply our way out of a problem. Every change goes through Git, gets audited, and can be rolled back.
The cluster is now self-healing, drift-free, and fully automated. Infrastructure changes are as simple as committing to Git. The cluster pulls and deploys automatically.
We took 6,247 files of chaos and turned them into a thundering, self-managing infrastructure platform.
And you know what? It feels good to let the cluster thunder.
Resources
- cortex-gitops: Infrastructure manifests
- cortex-platform: Application code
- ArgoCD: GitOps operator
- K3s: Lightweight Kubernetes (7-node cluster)
- Project directive: CLAUDE.md v2.1.0
About
This transformation was completed as “Project Thunder” - a comprehensive migration from development sprawl to GitOps-controlled infrastructure. The entire process, from audit to full automation, took approximately 90 minutes of focused work.
The result: 120 resources under GitOps control, 7 ArgoCD applications, 100% auto-sync and self-heal enabled, and a cluster that enforces its own desired state.
The control plane whispers; the cluster thunders.