Monitoring Your Infrastructure with Prometheus & Grafana

Picture this: it’s 3 AM, and your infrastructure is silently degrading. CPU usage creeps upward, memory leaks slowly consume resources, and disk space dwindles. Without monitoring, you’ll discover these problems when users report errors or systems crash. With proper observability, you see these patterns emerging hours or days in advance, turning potential disasters into routine maintenance windows.

Modern infrastructure monitoring isn’t about collecting data—it’s about transforming raw metrics into actionable insights. Prometheus and Grafana form the backbone of this transformation: Prometheus collects and stores time-series data with surgical precision, while Grafana transforms those numbers into visual stories that reveal the health and behavior of your systems at a glance.

Why Monitoring Matters

The difference between reactive and proactive operations is visibility. Reactive teams respond to outages, explaining what went wrong after the fact. Proactive teams see problems forming, address issues before they impact users, and make data-driven decisions about capacity and performance.

Effective monitoring delivers three critical capabilities:

Real-time awareness: Know what’s happening right now across every component of your infrastructure. CPU, memory, disk, network, application metrics—all flowing into a unified view.

Historical context: Understand trends over time. Is this spike normal for Monday morning? How does today’s traffic compare to last week? Context transforms data points into meaningful patterns.

Intelligent alerting: Get notified about problems that matter, not every minor fluctuation. Well-configured alerts escalate issues before they become emergencies, and stay silent when everything runs smoothly.

Prerequisites

Before diving into Prometheus and Grafana, ensure you have:

A Linux server or Kubernetes cluster where you’ll install the monitoring stack
Root or sudo access for installation and configuration
Basic familiarity with YAML configuration files
Understanding of your application’s key performance indicators
Network access between your monitoring server and the systems you’ll monitor

Setting Up Prometheus

Prometheus follows a pull model, scraping metrics from HTTP endpoints at regular intervals. This architecture makes it reliable, scalable, and straightforward to configure.

Installing Prometheus

Download and install Prometheus on your monitoring server:

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

# Extract the archive
tar xvfz prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

# Move binaries to system path
sudo mv prometheus promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo mv consoles console_libraries /etc/prometheus/

Configuring Prometheus

Create your main configuration file at /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-west-2'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Rules for aggregation and alerting
rule_files:
  - "alerts.yml"
  - "recording_rules.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporters for system metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'server1.example.com:9100'
          - 'server2.example.com:9100'
          - 'server3.example.com:9100'

  # Application metrics
  - job_name: 'api'
    static_configs:
      - targets:
          - 'api1.example.com:8080'
          - 'api2.example.com:8080'
    metrics_path: '/metrics'

Installing Node Exporter

Node Exporter collects system-level metrics. Install it on every server you want to monitor:

# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz

# Extract and install
tar xvfz node_exporter-1.6.0.linux-amd64.tar.gz
sudo mv node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/

# Create a systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Starting Prometheus

Run Prometheus with your configuration:

prometheus --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries

Visit http://your-server:9090 to access the Prometheus web interface and verify that your targets are being scraped successfully.

Creating Grafana Dashboards

Grafana transforms Prometheus metrics into stunning visualizations that make understanding your infrastructure intuitive and immediate.

Installing Grafana

# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Access Grafana at http://your-server:3000 (default credentials: admin/admin).

Adding Prometheus as a Data Source

Navigate to Configuration > Data Sources
Click “Add data source”
Select Prometheus
Set the URL to http://localhost:9090
Click “Save & Test”

Building Your First Dashboard

Create a comprehensive system overview dashboard:

CPU Usage Panel: Track processor utilization across servers

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage Panel: Monitor available memory

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Disk Usage Panel: Watch storage consumption

100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)

Network Traffic Panel: Visualize bandwidth utilization

rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

Configure each panel with appropriate thresholds: green for healthy, yellow for warning, red for critical. Use time series graphs for trends, gauges for current values, and stat panels for key metrics.

Setting Up Alerts

Alerts transform monitoring from passive observation to active protection. Configure alerting rules in /etc/prometheus/alerts.yml:

groups:
  - name: infrastructure_alerts
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      # Low disk space
      - alert: LowDiskSpace
        expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100) > 85
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is above 85% (current value: {{ $value }}%)"

      # High memory usage
      - alert: HighMemoryUsage
        expr: 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% (current value: {{ $value }}%)"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
          description: "The service has been down for more than 1 minute"

Configure Alertmanager to route notifications to your preferred channels: email, Slack, PagerDuty, or custom webhooks. The key is ensuring critical alerts reach the right people immediately, while informational alerts aggregate into daily summaries.

What’s Next

You’ve built a foundation for infrastructure observability, but this is just the beginning. As your monitoring matures, consider:

Custom application metrics: Instrument your code to expose business-level metrics—user signups, transaction processing times, error rates by endpoint.

Advanced exporters: Deploy specialized exporters for databases (MySQL, PostgreSQL), message queues (RabbitMQ, Kafka), and cloud services (AWS, GCP, Azure).

Distributed tracing: Add Jaeger or Tempo to trace requests across microservices, understanding exactly where latency occurs in complex transactions.

Long-term storage: Implement Thanos or Cortex for infinite retention of historical metrics, enabling year-over-year comparisons and deep forensic analysis.

The systems you build deserve to be understood. With Prometheus collecting metrics and Grafana visualizing them, you’ve transformed your infrastructure from an opaque black box into a transparent, observable system where every component tells its story in real-time.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Watching Infrastructure Learn From Itself: A Claude Code Reflection

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Concept: Homomorphic encryption techniques for secure computation on encrypted data