15 Metrics, 8 Alerts: Building a Real-Time Production Dashboard for AI Agents

When your AI agent system orchestrates hundreds of autonomous workers across multiple specialized masters, traditional monitoring approaches break down. You can’t afford to check logs manually when a worker hangs. You can’t wait for user reports when routing cascades degrade. You need production monitoring that’s as intelligent as the agents it watches.

After battling zombie workers, routing failures, and cascading performance issues, we built a comprehensive monitoring system for Cortex. This isn’t theoretical observability—this is battle-tested production infrastructure that keeps a distributed agentic system running 24/7.

The Monitoring Philosophy: Know Everything, Alert on What Matters

Before diving into specific metrics, let’s establish the principle that guides our entire approach: comprehensive collection with intelligent alerting.

Every system event gets captured. Every worker heartbeat is logged. Every routing decision is traced. But we don’t wake engineers for every blip—we alert only when human intervention prevents imminent failure or when patterns indicate systemic issues.

This philosophy emerged from painful experience. Early versions of Cortex generated hundreds of false positives daily. Engineers developed alert fatigue and started ignoring critical warnings. We learned that more alerts doesn’t mean better monitoring—it means worse outcomes.

The 15 Critical Metrics

Our production dashboard tracks these metrics continuously, organized by system layer.

Worker Fleet Health (5 Metrics)

1. Active Worker Count (worker.active_count)

Type: Gauge
Target: ≥2 workers always available
Why it matters: Below 2 workers means no redundancy. A single failure takes down task processing entirely.
Collection: Heartbeat monitor daemon polls every 30 seconds
Alert threshold: <2 workers (critical)

2. Worker Failure Rate (worker.failure_rate)

Type: Counter
Target: <5 failures per hour
Why it matters: Persistent failures indicate systemic issues—bad context injection, dependency failures, or resource exhaustion.
Collection: Aggregated from worker completion events
Alert threshold: >5/hour (warning), >10/hour (critical)

3. Zombie Worker Count (worker.zombie_count)

Type: Gauge
Target: 0 zombies
Why it matters: Zombie workers consume resources without contributing work. They’re symptomatic of deeper coordination failures.
Collection: Heartbeat monitor marks workers zombie after 5 minutes without heartbeat
Alert threshold: >0 (critical)

4. Worker Spawn Time (worker.spawn_time_ms)

Type: Histogram
Target: <1000ms average, P95 <2000ms
Why it matters: Slow spawns cascade into task queue backlogs. If workers can’t spin up quickly, throughput collapses.
Collection: Tracked from spawn initiation to first heartbeat
Alert threshold: P95 >3000ms (warning)

5. Task Duration (worker.task_duration_ms)

Type: Histogram
Target: P95 <30,000ms (30 seconds)
Why it matters: Unexplained increases indicate performance degradation—API slowdowns, resource contention, or inefficient routing.
Collection: End-to-end from task assignment to completion
Alert threshold: P95 >60,000ms (warning)

Routing Performance (4 Metrics)

6. Routing Layer Distribution (routing.layer_distribution)

Type: Counter (by layer)
Target: Layer 1 (keyword): 30-40%, Layer 2 (semantic): 30-40%, Layer 3 (RAG): 15-25%, Layer 4 (PyTorch): 5-10%
Why it matters: Imbalanced distribution means routing cascade isn’t optimizing efficiently. Too many Layer 4 hits = wasted compute. Too many Layer 1 hits = oversimplified routing.
Collection: Every routing decision emits layer used
Alert threshold: Any layer <10% or >60% of total (info)

7. Routing Latency (routing.latency_ms)

Type: Histogram (by layer)
Target: Layer 1 <1ms, Layer 2 10-50ms, Layer 3 50-150ms, Layer 4 100-300ms
Why it matters: Slow routing adds latency to every task. Layer 1 should be near-instant; slower layers should only activate when needed.
Collection: Timed per routing decision
Alert threshold: P95 >300ms (warning)

8. Clarification Rate (routing.clarification_rate)

Type: Gauge (percentage)
Target: <3%
Why it matters: High clarification rates mean routing confidence is too low. Users get frustrated asking to rephrase queries.
Collection: Percentage of queries routed to CLARIFY agent
Alert threshold: >5% (warning), >10% (critical)

9. Routing Confidence (routing.confidence_score)

Type: Histogram
Target: 80% of decisions >0.8 confidence
Why it matters: Low confidence means routing is guessing. High confidence means routing matches are strong.
Collection: Confidence score from each routing decision
Alert threshold: Average confidence <0.6 for 15 minutes (warning)

System Health (4 Metrics)

10. Event Processing Lag (observability.processing_lag_ms)

Type: Gauge
Target: <5000ms
Why it matters: If event processing falls behind, monitoring data becomes stale. You’re flying blind during incidents.
Collection: Time delta between event generation and indexing
Alert threshold: >30,000ms (warning)

11. Heartbeat Miss Rate (heartbeat.miss_rate)

Type: Gauge (percentage)
Target: <5%
Why it matters: Missed heartbeats precede worker failures. Rising miss rates predict imminent zombie workers.
Collection: Percentage of expected heartbeats not received
Alert threshold: >10% (warning), >25% (critical)

12. Daemon Health (daemon.health_status)

Type: Gauge (binary: up/down per daemon)
Target: All daemons up
Why it matters: Core daemons (heartbeat monitor, observability hub, metrics aggregator) are single points of failure. If they’re down, monitoring goes dark.
Collection: Process health checks every 60 seconds
Alert threshold: Any daemon down (critical)

13. Failure Pattern Count (failure_pattern.active_patterns)

Type: Counter
Target: 0 high-severity patterns
Why it matters: Recurring failure patterns indicate unresolved systemic issues. Same error repeating = tech debt.
Collection: Pattern detection daemon analyzes failure events
Alert threshold: >3 high-severity patterns (warning)

Performance Metrics (2 Metrics)

14. Mean Time to Recovery (MTTR) (dora.mttr_minutes)

Type: Histogram
Target: <60 minutes (elite: <60min, high: <1440min)
Why it matters: MTTR measures how quickly you recover from incidents. Long MTTR = extended downtime.
Collection: Time from failure detection to resolution
Alert threshold: MTTR doubling baseline (warning)

15. Change Failure Rate (dora.change_failure_rate)

Type: Gauge (percentage)
Target: <5%
Why it matters: High failure rates mean deployments break production. It creates fear of shipping.
Collection: Failed deployments / total deployments
Alert threshold: >10% (warning), >15% (critical)

The 8 Essential Alerts

Metrics are data. Alerts are action. Here are the 8 alerts that wake engineers (and why each one matters).

Critical Alerts (Immediate Response)

Alert 1: Zombie Threshold Exceeded

Trigger: worker.zombie_count > 10
Severity: High
SLA: 30 minutes
Why: 10+ zombies means worker spawning is fundamentally broken. Task processing is grinding to a halt.
Action: Auto-cleanup daemon kills zombies, attempts respawn. Alert escalates if cleanup fails.
Example: “Zombie threshold exceeded: 44 zombies detected (threshold: 10)”

Alert 2: Worker Daemon Down

Trigger: Worker daemon process not found for 120 seconds
Severity: High
SLA: 30 minutes
Why: No daemon = no worker management = no task processing.
Action: Attempt auto-restart. If restart fails, page on-call.
Example: “Worker daemon process not found”

Alert 3: Observability Hub Offline

Trigger: Observability hub daemon unreachable for 120 seconds
Severity: High
SLA: 30 minutes
Why: If monitoring is down, you can’t see other failures. Monitoring blindness is catastrophic.
Action: Restart observability hub. Flush backlog of unprocessed events.
Example: “ObservabilityHub daemon heartbeat stale”

Alert 4: Routing Cascade Degraded

Trigger: routing.latency_ms P95 > 300ms for 5 minutes
Severity: Warning (escalates to high after 15 minutes)
SLA: 1 hour
Why: Slow routing cascades into slow task processing. User experience degrades.
Action: Check semantic search service health, verify RAG index freshness, inspect PyTorch model load.
Example: “P95 routing latency exceeds 300ms - investigate cascade performance”

Warning Alerts (Proactive Investigation)

Alert 5: High Clarification Rate

Trigger: routing.clarification_rate > 5% over 5-minute window
Severity: Warning
SLA: 4 hours
Why: Users are getting frustrated with ambiguous responses. Routing confidence is too low.
Action: Review recent clarification queries, retrain routing models if pattern emerges.
Example: “Clarification rate exceeds 5% - routing cascade may need tuning”

Alert 6: Worker Failure Rate Spike

Trigger: worker.failure_rate > 5/hour for 15 minutes
Severity: Warning (escalates to high if sustained >1 hour)
SLA: 4 hours
Why: Isolated failures are normal. Sustained failures indicate systemic issues.
Action: Run failure pattern analysis, check for common error codes, review recent deployments.
Example: “Worker failure rate exceeds 5/hour - check failure patterns”

Alert 7: Heartbeat Miss Rate Rising

Trigger: heartbeat.miss_rate > 10% over 10-minute window
Severity: Warning (predicts imminent zombie surge)
SLA: 4 hours
Why: Early warning signal. Heartbeats degrade before workers zombie out.
Action: Investigate network issues, check worker resource consumption, verify heartbeat daemon health.
Example: “Heartbeat miss rate 15% - workers may become unresponsive”

Informational Alerts (Track and Document)

Alert 8: Routing Layer Imbalance

Trigger: Keyword layer handling <20% or >60% over 15 minutes
Severity: Info
SLA: 24 hours
Why: Not urgent, but indicates routing cascade may need rebalancing.
Action: Log for weekly review, check if query patterns have shifted.
Example: “Keyword layer handling unusual percentage of queries - check routing patterns”

Real-Time Dashboard Architecture

Metrics and alerts flow through a three-layer architecture optimized for speed and reliability.

Layer 1: Collection (Zero-Overhead Instrumentation)

Design Principle: Metrics collection must add <5ms overhead per operation.

# Example: Recording worker spawn
record_gauge "worker.active_count" "$WORKER_COUNT" \
  '{"worker_type":"implementation"}' "count"

# Example: Recording routing decision
record_histogram "routing.latency_ms" "$DURATION" \
  '{"layer":"semantic","confidence":"0.87"}' "milliseconds"

Events are written to append-only JSONL files, partitioned by date. No blocking writes, no external dependencies.

File Structure:

coordination/observability/metrics/
├── raw/
│   ├── metrics-2025-12-12.jsonl  # Today's raw metrics
│   └── metrics-2025-12-11.jsonl  # Yesterday (compressed)
├── aggregated/
│   ├── hourly-2025-12-12.jsonl   # Hourly rollups
│   └── daily-2025-12-11.jsonl    # Daily rollups
└── indices/
    ├── by-trace-id.json          # Fast trace lookup
    └── by-worker-id.json         # Fast worker lookup

Layer 2: Aggregation (Real-Time Processing)

Three daemons process metrics continuously:

Observability Hub Daemon (observability-hub-daemon.sh)

Polls all-events.jsonl every 10 seconds
Builds correlation indices (trace_id → events, worker_id → events)
Generates real-time summaries
Processing lag target: <5 seconds

Heartbeat Monitor Daemon (heartbeat-monitor-daemon.sh)

Polls active workers every 30 seconds
Detects missed heartbeats (warning @ 60s, critical @ 120s)
Marks zombies after 300s
Triggers auto-cleanup for zombies

Metrics Aggregator Daemon (metrics-aggregator-daemon.sh)

Creates hourly rollups every hour
Creates daily rollups at midnight
Cleans up old raw data (30-day retention)
Keeps aggregated data indefinitely

Layer 3: Visualization (Elastic APM + Kibana)

Dashboard queries run against Elastic APM indices. Key visualizations:

Routing Cascade Performance Panel

Pie chart: Layer distribution (actual vs target)
Bar chart: Latency by layer (with threshold lines)
Line graph: Confidence scores over time
Metric: Clarification rate with alert threshold

Worker Fleet Health Panel

Gauge: Active worker count (red if <2)
Timeline: Worker spawn events
Heatmap: Task duration distribution
Table: Zombie workers (live update)

System Health Panel

Status indicators: All daemon health (green/red)
Line graph: Event processing lag
Counter: Total events processed today
Histogram: MTTR distribution

Failure Analysis Panel

Top 10 error codes
Failure pattern timeline
Worker failure rate trend
Auto-fix success rate

When to Alert vs When to Just Monitor

Not every metric deserves an alert. Here’s our decision framework:

Alert if:

Imminent user-facing impact (routing down, workers exhausted)
Data loss risk (observability hub offline, events unprocessed)
Security concern (unexpected daemon restart, auth failures)
Sustained degradation (15+ minutes of poor performance)

Monitor but don’t alert if:

Transient spikes (single worker failure, momentary latency)
Self-healing issues (zombie cleanup successful, daemon auto-restart)
Informational trends (routing layer rebalancing, seasonal patterns)
Below-threshold degradation (P95 latency 250ms when threshold is 300ms)

Real Example: Early versions alerted on every zombie worker. Engineers got 50+ alerts per day. We changed to alerting only when zombie count exceeds 10 and auto-cleanup fails. Alert volume dropped 90%, response time improved.

Alert Fatigue Prevention: Lessons from Production

Lesson 1: Alerts Must Be Actionable Bad alert: “Worker failure rate elevated” Good alert: “Worker failure rate 12/hour (threshold: 5) - Context injection errors detected. Runbook: docs/runbooks/context-injection-failure.md”

Lesson 2: Auto-Remediation Before Alerts Before alerting on zombie workers, auto-cleanup daemon attempts to kill and respawn. 80% of zombies resolve automatically. Alerts fire only on auto-remediation failure.

Lesson 3: Tiered Severity with SLA Critical (30min SLA) → High (4hr SLA) → Warning (24hr SLA) → Info (weekly review). Respects on-call engineer time while maintaining system health.

Lesson 4: Escalation Paths Warnings escalate to high after sustained duration. This catches slow-burn issues before they become incidents.

Lesson 5: Alert Consolidation Instead of alerting on 10 individual zombie workers, one alert covers “zombie threshold exceeded” with details. Reduces noise, increases signal.

Integration with Existing Tools

Elastic APM Integration

Cortex instruments routing decisions as APM transactions:

// Routing decision becomes APM transaction
apm.startTransaction('route_query', 'routing');
apm.setLabel('routing.layer', 'semantic');
apm.setLabel('routing.confidence', confidence);
apm.setLabel('routing.agent', routedAgent);
// ... routing logic ...
apm.endTransaction();

Benefits:

Distributed tracing across master → worker → task chains
Visual service maps showing routing cascade flow
Built-in latency percentile calculations
No custom aggregation code needed for basic metrics

Kibana Dashboard Setup

Import config/elastic-apm-dashboard.json for pre-built visualizations. Key queries:

Active Worker Count:

GET apm-*/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"labels.worker.state": "active"}},
        {"range": {"@timestamp": {"gte": "now-5m"}}}
      ]
    }
  },
  "aggs": {
    "worker_count": {"cardinality": {"field": "labels.worker.id"}}
  }
}

Routing Latency Timeline:

labels.routing.layer: *
| timechart p50, p95, p99 by labels.routing.layer

Cost Implications

Elastic APM Costs (production scale):

Events: ~50,000/day
Storage: ~500MB/day compressed
Retention: 30 days hot, 90 days warm
Estimated cost: $150-200/month (Elastic Cloud)

Alternative: File-based observability (current implementation) costs $0 but requires more operational overhead. Trade-off: $200/month vs 10 hours/month engineer time managing custom dashboards.

Our Choice: Hybrid approach. Critical metrics flow through Elastic APM for real-time dashboards. Historical analysis and custom queries use file-based observability. Best of both worlds.

Cost vs Visibility Trade-offs

The Observability Budget

Every metric has a cost:

Storage cost: ~1KB per event × 50k events/day = 50MB/day = $5/month storage
Processing cost: Aggregation daemons consume ~2% CPU = negligible on modern hardware
Query cost: Elastic APM charges per query. Heavy dashboard usage = higher bills.
Engineer cost: Custom metrics require maintenance. Standard metrics are “free” (built into APM).

Decision Framework: High-traffic paths (routing, worker spawning) get instrumented heavily. Low-traffic paths (daemon restarts, manual interventions) rely on logs.

Sampling for High-Volume Metrics

At scale (>1M events/day), sampling becomes necessary:

Worker heartbeats: Sample 10% (every 10th heartbeat logged)
Routing decisions: Sample 100% for first month, 50% after patterns established
Task completions: Always 100% (critical for billing/SLA tracking)
Daemon health checks: Sample 10% (reduce noise in indices)

Rule of thumb: If a metric doesn’t inform a decision or alert, stop collecting it. Observability for observability’s sake is waste.

The Vendor Lock-In Question

Elastic APM is convenient but proprietary. Our mitigation:

Dual-write strategy: Critical metrics write to both Elastic APM and local JSONL files
Standard formats: Use OpenTelemetry for instrumentation (can swap backends)
Export capability: Daily jobs export Elastic data to S3 (disaster recovery + cost reduction)
Gradual migration path: Can move to Prometheus/Grafana if Elastic costs spike

Production Lessons: What We Learned the Hard Way

Lesson 1: Zombie workers are inevitable. Early versions tried to prevent zombies through perfect coordination. Waste of time. Instead, detect and cleanup zombies aggressively. 5-minute zombie threshold works well.

Lesson 2: Routing cascade needs constant tuning. Query patterns shift. New agents get added. Confidence thresholds drift. Weekly reviews of routing layer distribution prevent slow degradation.

Lesson 3: Daemon health is single point of failure. If heartbeat monitor dies, zombie workers accumulate undetected. Daemon-monitoring-the-monitor-daemon is essential (systemd/launchd auto-restart).

Lesson 4: Alerts must include runbooks. “Worker failure rate elevated” is useless without next steps. Every alert links to runbook with investigation steps and common fixes.

Lesson 5: Historical data is debugging gold. When investigating a production incident, being able to query “show me all routing decisions with confidence <0.5 in the last hour” is invaluable. Invest in queryable observability.

Wrapping Up: Observability as a Product

The best monitoring system is the one engineers actually use. Our dashboard gets checked 20+ times per day. Alerts get investigated within SLA 95% of the time. When incidents happen, MTTR averages 23 minutes (down from 4 hours pre-monitoring).

This didn’t happen by accident. We treated observability as a product:

Fast queries (<1 second response time)
Intuitive visualizations (no training required)
Actionable alerts (clear next steps)
Low false-positive rate (<2% of alerts are noise)

If you’re building a distributed AI agent system, don’t bolt on monitoring as an afterthought. Design observability into the architecture from day one. Your future self—debugging a production incident at 2 AM—will thank you.

Next in the series: “Automatic Recovery: How Cortex Heals Itself Without Human Intervention” - deep dive into auto-remediation patterns that reduce MTTR to minutes.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Watching Infrastructure Learn From Itself: A Claude Code Reflection

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Concept: Homomorphic encryption techniques for secure computation on encrypted data