How I turned Kubernetes alerts into self-healing GitHub PRs — and survived 215 duplicate pull requests along the way.
The Problem
I run a homelab Kubernetes cluster. Like most homelabbers, I have a love-hate relationship with Docker Hub — it works great until it doesn’t. One day, my nodes lost TLS connectivity to registry-1.docker.io. The cascade was immediate:
- 30 pods stuck in
ErrImagePullandImagePullBackOff - Longhorn PVC cleanup helpers couldn’t pull their busybox image, blocking volume deletion
- ArgoCD applications went degraded across six namespaces
- Redis, KEDA, and Longhorn itself started crash-looping
The real problem wasn’t the outage. Outages happen. The problem was that I’d already built a pipeline to route alerts to GitHub as pull requests — and it was working too well. Every 15 minutes, my gitops-observer pipeline dutifully created a new PR for the same ongoing incident. Three days later, I had 215 open pull requests, all saying roughly the same thing.
The Architecture
The idea behind alert-as-PR is solid: treat infrastructure incidents like code changes. An alert fires, a PR opens with the diagnostic data, GitHub Actions runs remediation, and if everything resolves, the PR auto-merges. You get an audit trail, you get CI/CD integration, and you get the entire GitHub review workflow for free.
Here’s the flow:
Cluster alert fires
|
v
repository_dispatch to GitHub
|
v
create-alert-pr.yml --> opens PR with alert YAML
|
v
auto-remediate.yml triggers on PR
|
+-- ErrImagePull? --> force-delete stuck pods, patch Longhorn
+-- ArgoCD degraded? --> sync all degraded apps
+-- Post resolution summary as PR comment
|
v
Auto-merge if fully resolved
The alert source can be anything — Alertmanager webhooks, ArgoCD notifications, or a simple cron script that checks kubectl get pods --field-selector=status.phase!=Running.
What I Built
The Alert Intake (create-alert-pr.yml)
A GitHub Actions workflow triggered by repository_dispatch. It parses the incoming alert payload, generates a YAML file in alerts/incoming/, creates a branch, and opens a PR with severity labels.
The Remediator (auto-remediate.yml)
Triggers on any PR that touches alerts/incoming/**. Three stages:
Stage 1 — Triage: Parses the alert YAML and detects issue types (ErrImagePull, ArgoCD degraded, Longhorn PVC, OOMKilled) using pattern matching on the alert content.
Stage 2 — Remediate: Runs in parallel based on triage results:
- ImagePull failures: Diagnoses node connectivity, force-deletes stuck Longhorn helper pods, patches the helper image source, restarts Longhorn manager.
- ArgoCD degraded: Logs into ArgoCD, identifies degraded/OutOfSync apps, force-syncs with prune, waits for health.
Stage 3 — Report and close: Posts a resolution summary as a PR comment with a status table. If all remediations succeeded, auto-merges the PR via squash.
The Root Cause Fix (Zot Pull-Through Cache)
The real fix for Docker Hub dependency is to never depend on it directly. I deployed Zot — a lightweight, OCI-native registry — as a pull-through cache. It syncs images on-demand from Docker Hub, GHCR, registry.k8s.io, and Quay, then serves them locally. Combined with k3s registry mirror configuration, every image pull routes through Zot first. Docker Hub could go down for a week and my cluster wouldn’t notice.
Supporting Infrastructure
- ArgoCD notifications config — triggers
repository_dispatchon app degradation, sync failures, and image pull errors directly from ArgoCD, no external webhook relay needed. - Custom Longhorn health checks — Lua-based health assessments for Longhorn Volumes, Engines, and Nodes so ArgoCD accurately reports Longhorn state instead of showing false positives.
- TLS diagnostic script — drops a netshoot pod into the cluster to test DNS, TCP, TLS handshakes, certificate chains, and routing to Docker Hub. Outputs an interpretation guide.
- Manual remediation runbook — for when automation fails. Covers clock skew fixes, Longhorn PVC finalizer removal, ArgoCD hard refresh, and Docker Hub credential setup.
The 215-PR Problem
Here’s what happens when you build an alert pipeline without deduplication: you get a new PR every 15 minutes for the same incident. For three days straight.
The PRs themselves were fine — great diagnostics, proper formatting, correct severity labels. But 215 of them? That’s noise, not signal. The first PR tells you there’s a problem. PRs 2 through 215 tell you your alert system needs work.
The Fix: Deduplication
The enhanced create-alert-pr.yml now does a simple check before creating anything:
- On alert dispatch, query the GitHub API for open PRs matching the cluster name and alert title pattern.
- If an existing open alert PR is found, update it — change the title timestamp, add a comment with the latest payload.
- Only create a new PR if no active incident PR exists for that cluster.
This means a single ongoing incident produces one PR, updated in-place, no matter how many times the alert fires. When that PR gets merged (manually or via auto-merge after remediation), the next alert will create a fresh PR for the new incident.
The Cleanup
Closing 215 PRs by hand wasn’t an option. A quick gh pr list | xargs gh pr close with some parallelism took care of it. Each one got a comment noting it was superseded, and the associated branches were cleaned up.
Lessons Learned
Deduplication first. Any alert-to-ticket/PR system needs deduplication from day one. It’s easy to think “I’ll add that later” but the accumulation happens fast — 215 PRs in 3 days at 15-minute intervals.
Pull-through caches are non-negotiable. If you run Kubernetes anywhere — homelab, production, edge — deploy a registry cache. Docker Hub rate limits (100 pulls/6hr anonymous) and connectivity issues will bite you eventually. Zot deploys in minutes and eliminates the dependency entirely.
GitHub Actions is a surprisingly good remediation engine. It has secrets management, parallel job execution, conditional logic, and a built-in audit trail. The PR-as-incident pattern gives you review workflows, status checks, and auto-merge for free.
Clock skew causes more TLS failures than expired certificates. The number one cause of ErrImagePull with TLS errors in homelabs is node clock drift. A simple NTP check in your diagnostic script saves hours of debugging.
The Stack
| Component | Purpose |
|---|---|
| GitHub Actions | Alert intake, remediation, auto-merge |
| ArgoCD Notifications | Alert source (fires on app degradation) |
| Zot Registry | Pull-through cache for Docker Hub/GHCR/Quay |
| Longhorn | Storage (with custom ArgoCD health checks) |
| k3s | Kubernetes distribution |
What’s Next
The current system handles the two most common failure modes in my cluster. Next up:
- OOMKill remediation — auto-bump resource limits when pods get OOMKilled repeatedly
- Certificate renewal — detect expiring certs and trigger renewal workflows
- Slack integration — post to a channel when a PR is opened or auto-merged, so I don’t have to watch GitHub
- Metrics dashboard — track alert frequency, mean time to remediation, and auto-merge success rate
The code is all in git-steer-state if you want to fork it and adapt it for your own cluster. The only secrets you need are a kubeconfig, an ArgoCD token, and a GitHub PAT.
Built with GitHub Actions, ArgoCD, Zot, and one very patient homelab cluster.