Building an Autonomous GitOps Alert Resolver

How I turned Kubernetes alerts into self-healing GitHub PRs — and survived 215 duplicate pull requests along the way.

The Problem

I run a homelab Kubernetes cluster. Like most homelabbers, I have a love-hate relationship with Docker Hub — it works great until it doesn’t. One day, my nodes lost TLS connectivity to registry-1.docker.io. The cascade was immediate:

30 pods stuck in ErrImagePull and ImagePullBackOff
Longhorn PVC cleanup helpers couldn’t pull their busybox image, blocking volume deletion
ArgoCD applications went degraded across six namespaces
Redis, KEDA, and Longhorn itself started crash-looping

The real problem wasn’t the outage. Outages happen. The problem was that I’d already built a pipeline to route alerts to GitHub as pull requests — and it was working too well. Every 15 minutes, my gitops-observer pipeline dutifully created a new PR for the same ongoing incident. Three days later, I had 215 open pull requests, all saying roughly the same thing.

The Architecture

The idea behind alert-as-PR is solid: treat infrastructure incidents like code changes. An alert fires, a PR opens with the diagnostic data, GitHub Actions runs remediation, and if everything resolves, the PR auto-merges. You get an audit trail, you get CI/CD integration, and you get the entire GitHub review workflow for free.

Here’s the flow:

Cluster alert fires
       |
       v
repository_dispatch to GitHub
       |
       v
create-alert-pr.yml --> opens PR with alert YAML
       |
       v
auto-remediate.yml triggers on PR
       |
       +-- ErrImagePull? --> force-delete stuck pods, patch Longhorn
       +-- ArgoCD degraded? --> sync all degraded apps
       +-- Post resolution summary as PR comment
       |
       v
Auto-merge if fully resolved

The alert source can be anything — Alertmanager webhooks, ArgoCD notifications, or a simple cron script that checks kubectl get pods --field-selector=status.phase!=Running.

What I Built

The Alert Intake (`create-alert-pr.yml`)

A GitHub Actions workflow triggered by repository_dispatch. It parses the incoming alert payload, generates a YAML file in alerts/incoming/, creates a branch, and opens a PR with severity labels.

The Remediator (`auto-remediate.yml`)

Triggers on any PR that touches alerts/incoming/**. Three stages:

Stage 1 — Triage: Parses the alert YAML and detects issue types (ErrImagePull, ArgoCD degraded, Longhorn PVC, OOMKilled) using pattern matching on the alert content.

Stage 2 — Remediate: Runs in parallel based on triage results:

ImagePull failures: Diagnoses node connectivity, force-deletes stuck Longhorn helper pods, patches the helper image source, restarts Longhorn manager.
ArgoCD degraded: Logs into ArgoCD, identifies degraded/OutOfSync apps, force-syncs with prune, waits for health.

Stage 3 — Report and close: Posts a resolution summary as a PR comment with a status table. If all remediations succeeded, auto-merges the PR via squash.

The Root Cause Fix (Zot Pull-Through Cache)

The real fix for Docker Hub dependency is to never depend on it directly. I deployed Zot — a lightweight, OCI-native registry — as a pull-through cache. It syncs images on-demand from Docker Hub, GHCR, registry.k8s.io, and Quay, then serves them locally. Combined with k3s registry mirror configuration, every image pull routes through Zot first. Docker Hub could go down for a week and my cluster wouldn’t notice.

Supporting Infrastructure

ArgoCD notifications config — triggers repository_dispatch on app degradation, sync failures, and image pull errors directly from ArgoCD, no external webhook relay needed.
Custom Longhorn health checks — Lua-based health assessments for Longhorn Volumes, Engines, and Nodes so ArgoCD accurately reports Longhorn state instead of showing false positives.
TLS diagnostic script — drops a netshoot pod into the cluster to test DNS, TCP, TLS handshakes, certificate chains, and routing to Docker Hub. Outputs an interpretation guide.
Manual remediation runbook — for when automation fails. Covers clock skew fixes, Longhorn PVC finalizer removal, ArgoCD hard refresh, and Docker Hub credential setup.

The 215-PR Problem

Here’s what happens when you build an alert pipeline without deduplication: you get a new PR every 15 minutes for the same incident. For three days straight.

The PRs themselves were fine — great diagnostics, proper formatting, correct severity labels. But 215 of them? That’s noise, not signal. The first PR tells you there’s a problem. PRs 2 through 215 tell you your alert system needs work.

The Fix: Deduplication

The enhanced create-alert-pr.yml now does a simple check before creating anything:

On alert dispatch, query the GitHub API for open PRs matching the cluster name and alert title pattern.
If an existing open alert PR is found, update it — change the title timestamp, add a comment with the latest payload.
Only create a new PR if no active incident PR exists for that cluster.

This means a single ongoing incident produces one PR, updated in-place, no matter how many times the alert fires. When that PR gets merged (manually or via auto-merge after remediation), the next alert will create a fresh PR for the new incident.

The Cleanup

Closing 215 PRs by hand wasn’t an option. A quick gh pr list | xargs gh pr close with some parallelism took care of it. Each one got a comment noting it was superseded, and the associated branches were cleaned up.

Lessons Learned

Deduplication first. Any alert-to-ticket/PR system needs deduplication from day one. It’s easy to think “I’ll add that later” but the accumulation happens fast — 215 PRs in 3 days at 15-minute intervals.

Pull-through caches are non-negotiable. If you run Kubernetes anywhere — homelab, production, edge — deploy a registry cache. Docker Hub rate limits (100 pulls/6hr anonymous) and connectivity issues will bite you eventually. Zot deploys in minutes and eliminates the dependency entirely.

GitHub Actions is a surprisingly good remediation engine. It has secrets management, parallel job execution, conditional logic, and a built-in audit trail. The PR-as-incident pattern gives you review workflows, status checks, and auto-merge for free.

Clock skew causes more TLS failures than expired certificates. The number one cause of ErrImagePull with TLS errors in homelabs is node clock drift. A simple NTP check in your diagnostic script saves hours of debugging.

The Stack

Component	Purpose
GitHub Actions	Alert intake, remediation, auto-merge
ArgoCD Notifications	Alert source (fires on app degradation)
Zot Registry	Pull-through cache for Docker Hub/GHCR/Quay
Longhorn	Storage (with custom ArgoCD health checks)
k3s	Kubernetes distribution

What’s Next

The current system handles the two most common failure modes in my cluster. Next up:

OOMKill remediation — auto-bump resource limits when pods get OOMKilled repeatedly
Certificate renewal — detect expiring certs and trigger renewal workflows
Slack integration — post to a channel when a PR is opened or auto-merged, so I don’t have to watch GitHub
Metrics dashboard — track alert frequency, mean time to remediation, and auto-merge success rate

The code is all in git-steer-state if you want to fork it and adapt it for your own cluster. The only secrets you need are a kubeconfig, an ArgoCD token, and a GitHub PAT.

Built with GitHub Actions, ArgoCD, Zot, and one very patient homelab cluster.

Building an Autonomous GitOps Alert Resolver

The Problem

The Architecture

What I Built

The Alert Intake (`create-alert-pr.yml`)

The Remediator (`auto-remediate.yml`)

The Root Cause Fix (Zot Pull-Through Cache)

Supporting Infrastructure

The 215-PR Problem

The Fix: Deduplication

The Cleanup

Lessons Learned

The Stack

What’s Next

Written by

Related posts

Explore more from ry-ops

unifi-mcp-server

proxmox-mcp-server

cloudflare-mcp-server

git-steer

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Cleaning House: Migrating a 90-Deployment k3s Cluster to fabric-forge

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Obstacles to Teammates: How Automation Built Itself a Better Partner

Open Source

Git-Steer Can Contribute to Other People's Repos Too

Security

What the IBM X-Force Report Taught Us About Securing Our Own Tools

The Problem

The Architecture

What I Built

The Alert Intake (create-alert-pr.yml)

The Remediator (auto-remediate.yml)

The Root Cause Fix (Zot Pull-Through Cache)

Supporting Infrastructure

The 215-PR Problem

The Fix: Deduplication

The Cleanup

Lessons Learned

The Stack

What’s Next

Written by

Related posts

From Chaos to GitOps: How We Tamed 6,247 Files and Built a Self-Healing Infrastructure

Deploying 10 Langflow Workflows to Kubernetes: A GitOps Journey in 45 Minutes

Cleaning House: Migrating a 90-Deployment k3s Cluster to fabric-forge

Explore more from ry-ops

unifi-mcp-server

proxmox-mcp-server

cloudflare-mcp-server

git-steer

The Alert Intake (`create-alert-pr.yml`)

The Remediator (`auto-remediate.yml`)