From 80% Memory Panic to Optimized Excellence: Our K3s Cluster Transformation
The Crisis That Wasn’t
It started with a simple observation: all seven of our K3s cluster VMs were showing 80%+ RAM utilization in Proxmox. Red indicators everywhere. Time to add more RAM, right? Or maybe shut down some services? Perhaps even remove nodes?
Wrong.
What looked like a memory crisis turned out to be a masterclass in understanding Linux memory management, Kubernetes resource allocation, and the importance of measuring what actually matters.
This is the story of how we transformed our 7-node K3s cluster from perceived chaos to optimized excellence—and what we learned along the way.
The Investigation
The Alarming Numbers
Our Proxmox dashboard showed:
- All 7 VMs: 80-85% memory “used”
- Red warning indicators across the board
- Growing concern about cluster stability
- Questions about whether we needed more hardware
The natural instinct was to panic. More RAM? Fewer services? Remove nodes?
SSH to the Rescue
Instead of rushing to solutions, we SSH’d into each node and ran a simple command:
free -h
The revelation:
k3s-master01:
total used free shared buff/cache available
Mem: 5.8Gi 2.7Gi 304Mi 928Ki 3.1Gi 3.1Gi
Wait. 3.1 GB available? But Proxmox said we were at 80%!
The Truth About Linux Memory
Here’s what we discovered: Linux uses “free” RAM for disk caching. This is not waste—it’s brilliance.
The breakdown:
- Used by applications: 2.7 GB (47%)
- Used for cache/buffers: 3.1 GB (53%)
- Actually available: 3.1 GB (53%)
That cache is instantly reclaimable. When applications need memory, Linux frees the cache immediately. The system appears to use 80% RAM, but really has 50%+ available.
Monitoring tools lie when they show:
Used = Total - Free
The correct metric is:
Available = Free + Reclaimable Cache
Cluster-Wide Analysis
We SSH’d into all 7 nodes and found:
| Node | ”Used” (Misleading) | Available (Truth) | Status |
|---|---|---|---|
| k3s-master01 | 82% | 53% | ✅ Healthy |
| k3s-master02 | 84% | 55% | ✅ Healthy |
| k3s-master03 | 73% | 76% | ✅ Outstanding |
| k3s-worker01 | 84% | 57% | ✅ Healthy |
| k3s-worker02 | 86% | 72% | ✅ Outstanding |
| k3s-worker03 | 88% | 45% | ✅ Working hard |
| k3s-worker04 | 85% | 53% | ✅ Healthy |
Cluster Average: 59% available memory
Conclusion: No crisis. No additional RAM needed. Just a misunderstanding of metrics.
The Real Issues We Found
While our cluster was healthier than we thought, the investigation revealed actual optimization opportunities:
Issue 1: Workload Imbalance
k3s-worker03 was carrying the heaviest load:
- Memory usage: 3.2 GB (55% of node)
- Top consumer: Prometheus (1444 Mi)
- Other heavy services: Multiple Longhorn components
Meanwhile, k3s-master03 was nearly idle:
- Memory usage: 1.4 GB (24% of node)
- Pods running: 1 (just fleet-controller)
- Available capacity: 4.4 GB (76%) wasted
- Why? Tainted with
CriticalAddonsOnly=true:NoExecute
Issue 2: Memory Limits Overcommitment
The Kubernetes scheduler saw this on k3s-worker03:
Memory Capacity: 5.9 GB
Memory Limits Allocated: 9.4 GB
Overcommitment: 159%
Translation: If every pod tried to use its memory limit simultaneously, the node would have 159% more demand than capacity. OOM killer chaos would ensue.
Issue 3: Unbounded Memory Growth
Major services had no memory limits:
- Prometheus: 1444 Mi (unlimited)
- Wazuh Indexer: 894 Mi (unlimited)
- Rancher: 751 Mi (unlimited)
- Grafana: 671 Mi (unlimited)
- Longhorn Managers: ~175 Mi each × 4 (unlimited)
Without limits, these could consume all available memory during traffic spikes.
The Optimization Strategy
We developed a comprehensive, multi-phase approach:
Phase 1: Workload Redistribution
Objective: Move heavy workloads from overloaded worker03 to underutilized nodes.
Key Action: Prometheus Migration
Prometheus was consuming 1444 Mi on the busiest node. We needed to:
- Add memory limits (1Gi request / 2Gi limit)
- Reduce retention (10 days → 7 days)
- Move to a less-loaded node
Solution:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-kube-prometheus-stack-prometheus
spec:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- k3s-worker03 # Avoid overloaded node
containers:
- name: prometheus
resources:
requests:
memory: 1Gi
cpu: 100m
limits:
memory: 2Gi
cpu: 500m
Result:
- Prometheus memory: 1444 Mi → 466 Mi (68% reduction)
- Location: Moved to k3s-master03
- Worker03 freed: 1.4 GB of memory
Phase 2: The k3s-master03 Decision
The Question: Should we remove k3s-master03 to free up resources?
The Temptation:
- 6 GB RAM back to Proxmox host
- One less VM to maintain
- It’s barely being used anyway (24% memory)
The Problem: Kubernetes HA architecture requires an odd number of master nodes.
Understanding Etcd Quorum
Kubernetes uses etcd for distributed consensus. Etcd requires a majority (quorum) to function:
With 3 masters (current):
Quorum needed: 2 out of 3
Fault tolerance: Can lose 1 master
Failure scenario: master01 dies → Cluster stays UP ✅ (master02 + master03)
With 2 masters (if we removed master03):
Quorum needed: 2 out of 2
Fault tolerance: Cannot lose ANY master
Failure scenario: master01 dies → Cluster goes DOWN ❌ (only master02)
With 1 master:
No HA - Single point of failure
Cluster dies if master fails
Why 2 Masters is Worse Than 1
Two masters creates a split-brain risk:
Network partition splits cluster:
Side A: master01 (1/2 quorum) ❌
Side B: master02 (1/2 quorum) ❌
Result: Both sides fight for control, data corruption possible
With 1 master, you at least know you have no HA. With 2, you think you have HA but actually have a ticking time bomb.
Our Decision: Keep All 3 Masters ✅
But remove the taint to utilize master03’s capacity.
Phase 3: Implement Resource Governance
Created LimitRanges for 6 key namespaces:
apiVersion: v1
kind: LimitRange
metadata:
name: monitoring-limits
namespace: monitoring
spec:
limits:
- max:
memory: 4Gi
cpu: 2000m
min:
memory: 64Mi
cpu: 100m
default:
memory: 512Mi
cpu: 500m
defaultRequest:
memory: 256Mi
cpu: 250m
type: Container
Coverage:
- monitoring
- longhorn-system
- wazuh-security
- cortex-system
- cattle-system
- monitoring-exporters
Impact:
- All new pods get sensible defaults
- Prevents unbounded memory growth
- Overcommitment reduced from 159% → ~100%
Phase 4: Deploy Vertical Pod Autoscaler (VPA)
Future-proofing with automation:
helm install vpa fairwinds-stable/vpa --namespace vpa-system --create-namespace
Created VPA objects for 6 critical workloads:
- Prometheus
- Grafana
- Rancher
- Wazuh Indexer
- Wazuh Manager
- Wazuh Dashboard
Mode: Recommendation-only for first week
Benefits:
- Automatic right-sizing based on actual usage
- Prevents over-allocation
- Adapts to changing workload patterns
- Reduces manual tuning effort
The Results
Cluster-Wide Improvements
Memory Utilization (Actual):
| Node | Before | After | Change |
|---|---|---|---|
| k3s-worker03 | 52% (overloaded) | 29% (balanced) | -44% 🎯 |
| k3s-master03 | 24% (wasted) | 51% (utilized) | +113% 🎯 |
| k3s-worker01 | 42% | 54% | +29% |
| k3s-worker02 | 31% | 54% | +74% |
| k3s-worker04 | 48% | 31% | -35% |
Cluster Balance:
- Before: One node at 52%, one at 24% (28-point spread)
- After: All nodes 29-54% (25-point spread, better distributed)
Overcommitment:
- Before: k3s-worker03 at 159%
- After: All nodes <100%
Service-Level Improvements
Prometheus:
- Memory: 1444 Mi → 466 Mi (-68%)
- Retention: 10 days → 7 days
- Limits: None → 1Gi/2Gi
- Location: worker03 → master03
- Status: Stable and healthy
Rancher:
- Memory: 751 Mi → 512 Mi (-32%)
- Limits: None → 512Mi/1Gi
- Status: No performance degradation
Longhorn:
- Per-manager: 175 Mi → 128 Mi (-27%)
- Total (5 instances): 875 Mi → 640 Mi
- Limits: None → 128Mi/256Mi
- Status: Storage performance unaffected
Lessons Learned
1. Measure What Matters
Wrong Metric:
Memory Used = Total - Free
Shows: 80% (panic!)
Right Metric:
Memory Available = Free + Reclaimable Cache
Shows: 59% (healthy!)
Lesson: Understand what your monitoring tools are actually measuring. “Used” memory includes beneficial caching.
2. Linux Memory is Smart
Linux doesn’t waste RAM. It uses “free” memory for caching to improve performance. This cache is:
- Instantly reclaimable
- Improves disk I/O performance
- Transparent to applications
- A feature, not a bug
Lesson: High “used” memory is often a sign of a well-tuned system, not a problem.
3. High Availability Isn’t Optional
The temptation to remove master03 and reclaim 6 GB RAM was strong. But:
- Saving 6 GB isn’t worth cluster fragility
- 2 masters = split-brain risk
- Can’t do zero-downtime upgrades
- One failure = total cluster outage
Lesson: HA costs resources but saves businesses. Choose wisely.
4. Overcommitment is Dangerous
Kubernetes lets you allocate more limits than capacity exists. This works until it doesn’t:
- 159% limits allocated
- One spike = OOM killer rampage
- Unpredictable pod evictions
- Service disruptions
Lesson: Memory limits should sum to ≤100% of node capacity.
5. Automation > Manual Tuning
Manual right-sizing is:
- Time-consuming
- Error-prone
- Becomes stale as workloads change
- Requires constant attention
VPA automation:
- Learns from actual usage
- Adjusts continuously
- Scales with cluster growth
- Frees up engineering time
Lesson: Invest in automation for long-term efficiency.
Best Practices Established
Resource Management
✅ Always set memory requests and limits
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi"
✅ Use LimitRanges for defaults
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container
High Availability
✅ Always use odd number of masters (3, 5, 7)
✅ Never use 2 masters (worse than 1)
✅ Maintain quorum requirements
- 3 masters = tolerate 1 failure
- 5 masters = tolerate 2 failures
Monitoring & Alerting
✅ Monitor available memory, not used
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
✅ Alert on actual pressure, not cache
- alert: RealMemoryPressure
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.20
The Final Architecture
Cluster Configuration
Nodes: 7 total
- 3 masters (control plane + workloads)
- 4 workers (dedicated workloads)
Memory Distribution:
Total Capacity: 42 GB (7 × 6 GB)
Used by Apps: 17 GB (40%)
Used for Cache: 14 GB (33%)
Available: 25 GB (59%)
Overcommitment: <100% on all nodes
High Availability:
- 3-master control plane
- Etcd quorum: 2/3
- Fault tolerance: 1 master failure
- Zero-downtime upgrades: ✅
Resource Governance:
- LimitRanges: 6 namespaces
- Memory limits: All major workloads
- VPA monitoring: 6 critical services
- Overcommitment eliminated: ✅
Conclusion: From Panic to Excellence
What started as an apparent crisis—80% memory usage across all nodes—turned into a comprehensive optimization journey that taught us:
- Monitoring matters: Measure what matters (available, not used)
- Linux is smart: Cache is a feature, not a problem
- HA requires investment: 3 masters costs resources but saves businesses
- Limits prevent chaos: Unbounded growth = eventual disaster
- Automation scales: VPA > manual tuning
- Investigation > assumption: SSH revealed the truth
Our 7-node K3s cluster went from:
- ❌ Perceived crisis → ✅ Actual health
- ❌ Workload imbalance → ✅ Even distribution
- ❌ Dangerous overcommitment → ✅ Safe limits
- ❌ Wasted capacity → ✅ Efficient utilization
- ❌ Manual management → ✅ Automated optimization
The Numbers:
- Worker03: 52% → 29% memory (freed 1.4 GB)
- Master03: 24% → 51% memory (utilized 1.6 GB)
- Prometheus: 1444 Mi → 466 Mi (68% reduction)
- Overcommitment: 159% → <100% (eliminated risk)
- HA architecture: Preserved (3 masters)
- Total cost: $0 (pure optimization)
The Outcome: A production-ready, highly available, efficiently optimized Kubernetes cluster that’s ready to scale with our needs—without adding a single GB of RAM.
Sometimes the best optimization is understanding what you already have.
Cluster: 7-node K3s cluster (3 masters, 4 workers) Duration: 1 week optimization project Status: Production, Optimized, Highly Available Cost Savings: $0 spent on hardware upgrades