Skip to main content

From 80% Memory Panic to Optimized Excellence: Our K3s Cluster Transformation

Ryan Dahlberg
Ryan Dahlberg
December 22, 2025 10 min read
Share:
From 80% Memory Panic to Optimized Excellence: Our K3s Cluster Transformation

The Crisis That Wasn’t

It started with a simple observation: all seven of our K3s cluster VMs were showing 80%+ RAM utilization in Proxmox. Red indicators everywhere. Time to add more RAM, right? Or maybe shut down some services? Perhaps even remove nodes?

Wrong.

What looked like a memory crisis turned out to be a masterclass in understanding Linux memory management, Kubernetes resource allocation, and the importance of measuring what actually matters.

This is the story of how we transformed our 7-node K3s cluster from perceived chaos to optimized excellence—and what we learned along the way.

The Investigation

The Alarming Numbers

Our Proxmox dashboard showed:

  • All 7 VMs: 80-85% memory “used”
  • Red warning indicators across the board
  • Growing concern about cluster stability
  • Questions about whether we needed more hardware

The natural instinct was to panic. More RAM? Fewer services? Remove nodes?

SSH to the Rescue

Instead of rushing to solutions, we SSH’d into each node and ran a simple command:

free -h

The revelation:

k3s-master01:
               total        used        free      shared  buff/cache   available
Mem:           5.8Gi       2.7Gi       304Mi       928Ki       3.1Gi       3.1Gi

Wait. 3.1 GB available? But Proxmox said we were at 80%!

The Truth About Linux Memory

Here’s what we discovered: Linux uses “free” RAM for disk caching. This is not waste—it’s brilliance.

The breakdown:

  • Used by applications: 2.7 GB (47%)
  • Used for cache/buffers: 3.1 GB (53%)
  • Actually available: 3.1 GB (53%)

That cache is instantly reclaimable. When applications need memory, Linux frees the cache immediately. The system appears to use 80% RAM, but really has 50%+ available.

Monitoring tools lie when they show:

Used = Total - Free

The correct metric is:

Available = Free + Reclaimable Cache

Cluster-Wide Analysis

We SSH’d into all 7 nodes and found:

Node”Used” (Misleading)Available (Truth)Status
k3s-master0182%53%✅ Healthy
k3s-master0284%55%✅ Healthy
k3s-master0373%76%✅ Outstanding
k3s-worker0184%57%✅ Healthy
k3s-worker0286%72%✅ Outstanding
k3s-worker0388%45%✅ Working hard
k3s-worker0485%53%✅ Healthy

Cluster Average: 59% available memory

Conclusion: No crisis. No additional RAM needed. Just a misunderstanding of metrics.

The Real Issues We Found

While our cluster was healthier than we thought, the investigation revealed actual optimization opportunities:

Issue 1: Workload Imbalance

k3s-worker03 was carrying the heaviest load:

  • Memory usage: 3.2 GB (55% of node)
  • Top consumer: Prometheus (1444 Mi)
  • Other heavy services: Multiple Longhorn components

Meanwhile, k3s-master03 was nearly idle:

  • Memory usage: 1.4 GB (24% of node)
  • Pods running: 1 (just fleet-controller)
  • Available capacity: 4.4 GB (76%) wasted
  • Why? Tainted with CriticalAddonsOnly=true:NoExecute

Issue 2: Memory Limits Overcommitment

The Kubernetes scheduler saw this on k3s-worker03:

Memory Capacity:    5.9 GB
Memory Limits Allocated: 9.4 GB
Overcommitment:     159%

Translation: If every pod tried to use its memory limit simultaneously, the node would have 159% more demand than capacity. OOM killer chaos would ensue.

Issue 3: Unbounded Memory Growth

Major services had no memory limits:

  • Prometheus: 1444 Mi (unlimited)
  • Wazuh Indexer: 894 Mi (unlimited)
  • Rancher: 751 Mi (unlimited)
  • Grafana: 671 Mi (unlimited)
  • Longhorn Managers: ~175 Mi each × 4 (unlimited)

Without limits, these could consume all available memory during traffic spikes.

The Optimization Strategy

We developed a comprehensive, multi-phase approach:

Phase 1: Workload Redistribution

Objective: Move heavy workloads from overloaded worker03 to underutilized nodes.

Key Action: Prometheus Migration

Prometheus was consuming 1444 Mi on the busiest node. We needed to:

  1. Add memory limits (1Gi request / 2Gi limit)
  2. Reduce retention (10 days → 7 days)
  3. Move to a less-loaded node

Solution:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus-kube-prometheus-stack-prometheus
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: kubernetes.io/hostname
                operator: NotIn
                values:
                - k3s-worker03  # Avoid overloaded node
      containers:
      - name: prometheus
        resources:
          requests:
            memory: 1Gi
            cpu: 100m
          limits:
            memory: 2Gi
            cpu: 500m

Result:

  • Prometheus memory: 1444 Mi → 466 Mi (68% reduction)
  • Location: Moved to k3s-master03
  • Worker03 freed: 1.4 GB of memory

Phase 2: The k3s-master03 Decision

The Question: Should we remove k3s-master03 to free up resources?

The Temptation:

  • 6 GB RAM back to Proxmox host
  • One less VM to maintain
  • It’s barely being used anyway (24% memory)

The Problem: Kubernetes HA architecture requires an odd number of master nodes.

Understanding Etcd Quorum

Kubernetes uses etcd for distributed consensus. Etcd requires a majority (quorum) to function:

With 3 masters (current):

Quorum needed: 2 out of 3
Fault tolerance: Can lose 1 master
Failure scenario: master01 dies → Cluster stays UP ✅ (master02 + master03)

With 2 masters (if we removed master03):

Quorum needed: 2 out of 2
Fault tolerance: Cannot lose ANY master
Failure scenario: master01 dies → Cluster goes DOWN ❌ (only master02)

With 1 master:

No HA - Single point of failure
Cluster dies if master fails

Why 2 Masters is Worse Than 1

Two masters creates a split-brain risk:

Network partition splits cluster:
Side A: master01 (1/2 quorum) ❌
Side B: master02 (1/2 quorum) ❌

Result: Both sides fight for control, data corruption possible

With 1 master, you at least know you have no HA. With 2, you think you have HA but actually have a ticking time bomb.

Our Decision: Keep All 3 Masters ✅

But remove the taint to utilize master03’s capacity.

Phase 3: Implement Resource Governance

Created LimitRanges for 6 key namespaces:

apiVersion: v1
kind: LimitRange
metadata:
  name: monitoring-limits
  namespace: monitoring
spec:
  limits:
  - max:
      memory: 4Gi
      cpu: 2000m
    min:
      memory: 64Mi
      cpu: 100m
    default:
      memory: 512Mi
      cpu: 500m
    defaultRequest:
      memory: 256Mi
      cpu: 250m
    type: Container

Coverage:

  • monitoring
  • longhorn-system
  • wazuh-security
  • cortex-system
  • cattle-system
  • monitoring-exporters

Impact:

  • All new pods get sensible defaults
  • Prevents unbounded memory growth
  • Overcommitment reduced from 159% → ~100%

Phase 4: Deploy Vertical Pod Autoscaler (VPA)

Future-proofing with automation:

helm install vpa fairwinds-stable/vpa --namespace vpa-system --create-namespace

Created VPA objects for 6 critical workloads:

  • Prometheus
  • Grafana
  • Rancher
  • Wazuh Indexer
  • Wazuh Manager
  • Wazuh Dashboard

Mode: Recommendation-only for first week

Benefits:

  • Automatic right-sizing based on actual usage
  • Prevents over-allocation
  • Adapts to changing workload patterns
  • Reduces manual tuning effort

The Results

Cluster-Wide Improvements

Memory Utilization (Actual):

NodeBeforeAfterChange
k3s-worker0352% (overloaded)29% (balanced)-44% 🎯
k3s-master0324% (wasted)51% (utilized)+113% 🎯
k3s-worker0142%54%+29%
k3s-worker0231%54%+74%
k3s-worker0448%31%-35%

Cluster Balance:

  • Before: One node at 52%, one at 24% (28-point spread)
  • After: All nodes 29-54% (25-point spread, better distributed)

Overcommitment:

  • Before: k3s-worker03 at 159%
  • After: All nodes <100%

Service-Level Improvements

Prometheus:

  • Memory: 1444 Mi → 466 Mi (-68%)
  • Retention: 10 days → 7 days
  • Limits: None → 1Gi/2Gi
  • Location: worker03 → master03
  • Status: Stable and healthy

Rancher:

  • Memory: 751 Mi → 512 Mi (-32%)
  • Limits: None → 512Mi/1Gi
  • Status: No performance degradation

Longhorn:

  • Per-manager: 175 Mi → 128 Mi (-27%)
  • Total (5 instances): 875 Mi → 640 Mi
  • Limits: None → 128Mi/256Mi
  • Status: Storage performance unaffected

Lessons Learned

1. Measure What Matters

Wrong Metric:

Memory Used = Total - Free
Shows: 80% (panic!)

Right Metric:

Memory Available = Free + Reclaimable Cache
Shows: 59% (healthy!)

Lesson: Understand what your monitoring tools are actually measuring. “Used” memory includes beneficial caching.

2. Linux Memory is Smart

Linux doesn’t waste RAM. It uses “free” memory for caching to improve performance. This cache is:

  • Instantly reclaimable
  • Improves disk I/O performance
  • Transparent to applications
  • A feature, not a bug

Lesson: High “used” memory is often a sign of a well-tuned system, not a problem.

3. High Availability Isn’t Optional

The temptation to remove master03 and reclaim 6 GB RAM was strong. But:

  • Saving 6 GB isn’t worth cluster fragility
  • 2 masters = split-brain risk
  • Can’t do zero-downtime upgrades
  • One failure = total cluster outage

Lesson: HA costs resources but saves businesses. Choose wisely.

4. Overcommitment is Dangerous

Kubernetes lets you allocate more limits than capacity exists. This works until it doesn’t:

  • 159% limits allocated
  • One spike = OOM killer rampage
  • Unpredictable pod evictions
  • Service disruptions

Lesson: Memory limits should sum to ≤100% of node capacity.

5. Automation > Manual Tuning

Manual right-sizing is:

  • Time-consuming
  • Error-prone
  • Becomes stale as workloads change
  • Requires constant attention

VPA automation:

  • Learns from actual usage
  • Adjusts continuously
  • Scales with cluster growth
  • Frees up engineering time

Lesson: Invest in automation for long-term efficiency.

Best Practices Established

Resource Management

Always set memory requests and limits

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Use LimitRanges for defaults

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
spec:
  limits:
  - default:
      memory: 512Mi
    defaultRequest:
      memory: 256Mi
    type: Container

High Availability

Always use odd number of masters (3, 5, 7)

Never use 2 masters (worse than 1)

Maintain quorum requirements

  • 3 masters = tolerate 1 failure
  • 5 masters = tolerate 2 failures

Monitoring & Alerting

Monitor available memory, not used

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

Alert on actual pressure, not cache

- alert: RealMemoryPressure
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.20

The Final Architecture

Cluster Configuration

Nodes: 7 total

  • 3 masters (control plane + workloads)
  • 4 workers (dedicated workloads)

Memory Distribution:

Total Capacity:     42 GB (7 × 6 GB)
Used by Apps:       17 GB (40%)
Used for Cache:     14 GB (33%)
Available:          25 GB (59%)
Overcommitment:     <100% on all nodes

High Availability:

  • 3-master control plane
  • Etcd quorum: 2/3
  • Fault tolerance: 1 master failure
  • Zero-downtime upgrades: ✅

Resource Governance:

  • LimitRanges: 6 namespaces
  • Memory limits: All major workloads
  • VPA monitoring: 6 critical services
  • Overcommitment eliminated: ✅

Conclusion: From Panic to Excellence

What started as an apparent crisis—80% memory usage across all nodes—turned into a comprehensive optimization journey that taught us:

  1. Monitoring matters: Measure what matters (available, not used)
  2. Linux is smart: Cache is a feature, not a problem
  3. HA requires investment: 3 masters costs resources but saves businesses
  4. Limits prevent chaos: Unbounded growth = eventual disaster
  5. Automation scales: VPA > manual tuning
  6. Investigation > assumption: SSH revealed the truth

Our 7-node K3s cluster went from:

  • ❌ Perceived crisis → ✅ Actual health
  • ❌ Workload imbalance → ✅ Even distribution
  • ❌ Dangerous overcommitment → ✅ Safe limits
  • ❌ Wasted capacity → ✅ Efficient utilization
  • ❌ Manual management → ✅ Automated optimization

The Numbers:

  • Worker03: 52% → 29% memory (freed 1.4 GB)
  • Master03: 24% → 51% memory (utilized 1.6 GB)
  • Prometheus: 1444 Mi → 466 Mi (68% reduction)
  • Overcommitment: 159% → <100% (eliminated risk)
  • HA architecture: Preserved (3 masters)
  • Total cost: $0 (pure optimization)

The Outcome: A production-ready, highly available, efficiently optimized Kubernetes cluster that’s ready to scale with our needs—without adding a single GB of RAM.

Sometimes the best optimization is understanding what you already have.


Cluster: 7-node K3s cluster (3 masters, 4 workers) Duration: 1 week optimization project Status: Production, Optimized, Highly Available Cost Savings: $0 spent on hardware upgrades

#Kubernetes #K3s #DevOps #Memory Optimization #High Availability #Infrastructure