Monitoring Your Infrastructure with Prometheus & Grafana
Transform your infrastructure into an observable system with real-time metrics, beautiful dashboards, and intelligent alerting
Picture this: it’s 3 AM, and your infrastructure is silently degrading. CPU usage creeps upward, memory leaks slowly consume resources, and disk space dwindles. Without monitoring, you’ll discover these problems when users report errors or systems crash. With proper observability, you see these patterns emerging hours or days in advance, turning potential disasters into routine maintenance windows.
Modern infrastructure monitoring isn’t about collecting data—it’s about transforming raw metrics into actionable insights. Prometheus and Grafana form the backbone of this transformation: Prometheus collects and stores time-series data with surgical precision, while Grafana transforms those numbers into visual stories that reveal the health and behavior of your systems at a glance.
Why Monitoring Matters
The difference between reactive and proactive operations is visibility. Reactive teams respond to outages, explaining what went wrong after the fact. Proactive teams see problems forming, address issues before they impact users, and make data-driven decisions about capacity and performance.
Effective monitoring delivers three critical capabilities:
Real-time awareness: Know what’s happening right now across every component of your infrastructure. CPU, memory, disk, network, application metrics—all flowing into a unified view.
Historical context: Understand trends over time. Is this spike normal for Monday morning? How does today’s traffic compare to last week? Context transforms data points into meaningful patterns.
Intelligent alerting: Get notified about problems that matter, not every minor fluctuation. Well-configured alerts escalate issues before they become emergencies, and stay silent when everything runs smoothly.
Prerequisites
Before diving into Prometheus and Grafana, ensure you have:
- A Linux server or Kubernetes cluster where you’ll install the monitoring stack
- Root or sudo access for installation and configuration
- Basic familiarity with YAML configuration files
- Understanding of your application’s key performance indicators
- Network access between your monitoring server and the systems you’ll monitor
Setting Up Prometheus
Prometheus follows a pull model, scraping metrics from HTTP endpoints at regular intervals. This architecture makes it reliable, scalable, and straightforward to configure.
Installing Prometheus
Download and install Prometheus on your monitoring server:
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
# Extract the archive
tar xvfz prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
# Move binaries to system path
sudo mv prometheus promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo mv consoles console_libraries /etc/prometheus/
Configuring Prometheus
Create your main configuration file at /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Rules for aggregation and alerting
rule_files:
- "alerts.yml"
- "recording_rules.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporters for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'server1.example.com:9100'
- 'server2.example.com:9100'
- 'server3.example.com:9100'
# Application metrics
- job_name: 'api'
static_configs:
- targets:
- 'api1.example.com:8080'
- 'api2.example.com:8080'
metrics_path: '/metrics'
Installing Node Exporter
Node Exporter collects system-level metrics. Install it on every server you want to monitor:
# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
# Extract and install
tar xvfz node_exporter-1.6.0.linux-amd64.tar.gz
sudo mv node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
# Create a systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Starting Prometheus
Run Prometheus with your configuration:
prometheus --config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
Visit http://your-server:9090 to access the Prometheus web interface and verify that your targets are being scraped successfully.
Creating Grafana Dashboards
Grafana transforms Prometheus metrics into stunning visualizations that make understanding your infrastructure intuitive and immediate.
Installing Grafana
# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Install Grafana
sudo apt-get update
sudo apt-get install grafana
# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
Access Grafana at http://your-server:3000 (default credentials: admin/admin).
Adding Prometheus as a Data Source
- Navigate to Configuration > Data Sources
- Click “Add data source”
- Select Prometheus
- Set the URL to
http://localhost:9090 - Click “Save & Test”
Building Your First Dashboard
Create a comprehensive system overview dashboard:
CPU Usage Panel: Track processor utilization across servers
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory Usage Panel: Monitor available memory
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
Disk Usage Panel: Watch storage consumption
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)
Network Traffic Panel: Visualize bandwidth utilization
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
Configure each panel with appropriate thresholds: green for healthy, yellow for warning, red for critical. Use time series graphs for trends, gauges for current values, and stat panels for key metrics.
Setting Up Alerts
Alerts transform monitoring from passive observation to active protection. Configure alerting rules in /etc/prometheus/alerts.yml:
groups:
- name: infrastructure_alerts
interval: 30s
rules:
# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current value: {{ $value }}%)"
# Low disk space
- alert: LowDiskSpace
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100) > 85
for: 10m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 85% (current value: {{ $value }}%)"
# High memory usage
- alert: HighMemoryUsage
expr: 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% (current value: {{ $value }}%)"
# Service down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
description: "The service has been down for more than 1 minute"
Configure Alertmanager to route notifications to your preferred channels: email, Slack, PagerDuty, or custom webhooks. The key is ensuring critical alerts reach the right people immediately, while informational alerts aggregate into daily summaries.
What’s Next
You’ve built a foundation for infrastructure observability, but this is just the beginning. As your monitoring matures, consider:
Custom application metrics: Instrument your code to expose business-level metrics—user signups, transaction processing times, error rates by endpoint.
Advanced exporters: Deploy specialized exporters for databases (MySQL, PostgreSQL), message queues (RabbitMQ, Kafka), and cloud services (AWS, GCP, Azure).
Distributed tracing: Add Jaeger or Tempo to trace requests across microservices, understanding exactly where latency occurs in complex transactions.
Long-term storage: Implement Thanos or Cortex for infinite retention of historical metrics, enabling year-over-year comparisons and deep forensic analysis.
The systems you build deserve to be understood. With Prometheus collecting metrics and Grafana visualizing them, you’ve transformed your infrastructure from an opaque black box into a transparent, observable system where every component tells its story in real-time.
Learn, Contribute & Share
This guide has a companion repository with working examples and code samples.