15 Metrics, 8 Alerts: Building a Real-Time Production Dashboard for AI Agents
When your AI agent system orchestrates hundreds of autonomous workers across multiple specialized masters, traditional monitoring approaches break down. You can’t afford to check logs manually when a worker hangs. You can’t wait for user reports when routing cascades degrade. You need production monitoring that’s as intelligent as the agents it watches.
After battling zombie workers, routing failures, and cascading performance issues, we built a comprehensive monitoring system for Cortex. This isn’t theoretical observability—this is battle-tested production infrastructure that keeps a distributed agentic system running 24/7.
The Monitoring Philosophy: Know Everything, Alert on What Matters
Before diving into specific metrics, let’s establish the principle that guides our entire approach: comprehensive collection with intelligent alerting.
Every system event gets captured. Every worker heartbeat is logged. Every routing decision is traced. But we don’t wake engineers for every blip—we alert only when human intervention prevents imminent failure or when patterns indicate systemic issues.
This philosophy emerged from painful experience. Early versions of Cortex generated hundreds of false positives daily. Engineers developed alert fatigue and started ignoring critical warnings. We learned that more alerts doesn’t mean better monitoring—it means worse outcomes.
The 15 Critical Metrics
Our production dashboard tracks these metrics continuously, organized by system layer.
Worker Fleet Health (5 Metrics)
1. Active Worker Count (worker.active_count)
- Type: Gauge
- Target: ≥2 workers always available
- Why it matters: Below 2 workers means no redundancy. A single failure takes down task processing entirely.
- Collection: Heartbeat monitor daemon polls every 30 seconds
- Alert threshold: <2 workers (critical)
2. Worker Failure Rate (worker.failure_rate)
- Type: Counter
- Target: <5 failures per hour
- Why it matters: Persistent failures indicate systemic issues—bad context injection, dependency failures, or resource exhaustion.
- Collection: Aggregated from worker completion events
- Alert threshold: >5/hour (warning), >10/hour (critical)
3. Zombie Worker Count (worker.zombie_count)
- Type: Gauge
- Target: 0 zombies
- Why it matters: Zombie workers consume resources without contributing work. They’re symptomatic of deeper coordination failures.
- Collection: Heartbeat monitor marks workers zombie after 5 minutes without heartbeat
- Alert threshold: >0 (critical)
4. Worker Spawn Time (worker.spawn_time_ms)
- Type: Histogram
- Target: <1000ms average, P95 <2000ms
- Why it matters: Slow spawns cascade into task queue backlogs. If workers can’t spin up quickly, throughput collapses.
- Collection: Tracked from spawn initiation to first heartbeat
- Alert threshold: P95 >3000ms (warning)
5. Task Duration (worker.task_duration_ms)
- Type: Histogram
- Target: P95 <30,000ms (30 seconds)
- Why it matters: Unexplained increases indicate performance degradation—API slowdowns, resource contention, or inefficient routing.
- Collection: End-to-end from task assignment to completion
- Alert threshold: P95 >60,000ms (warning)
Routing Performance (4 Metrics)
6. Routing Layer Distribution (routing.layer_distribution)
- Type: Counter (by layer)
- Target: Layer 1 (keyword): 30-40%, Layer 2 (semantic): 30-40%, Layer 3 (RAG): 15-25%, Layer 4 (PyTorch): 5-10%
- Why it matters: Imbalanced distribution means routing cascade isn’t optimizing efficiently. Too many Layer 4 hits = wasted compute. Too many Layer 1 hits = oversimplified routing.
- Collection: Every routing decision emits layer used
- Alert threshold: Any layer <10% or >60% of total (info)
7. Routing Latency (routing.latency_ms)
- Type: Histogram (by layer)
- Target: Layer 1 <1ms, Layer 2 10-50ms, Layer 3 50-150ms, Layer 4 100-300ms
- Why it matters: Slow routing adds latency to every task. Layer 1 should be near-instant; slower layers should only activate when needed.
- Collection: Timed per routing decision
- Alert threshold: P95 >300ms (warning)
8. Clarification Rate (routing.clarification_rate)
- Type: Gauge (percentage)
- Target: <3%
- Why it matters: High clarification rates mean routing confidence is too low. Users get frustrated asking to rephrase queries.
- Collection: Percentage of queries routed to CLARIFY agent
- Alert threshold: >5% (warning), >10% (critical)
9. Routing Confidence (routing.confidence_score)
- Type: Histogram
- Target: 80% of decisions >0.8 confidence
- Why it matters: Low confidence means routing is guessing. High confidence means routing matches are strong.
- Collection: Confidence score from each routing decision
- Alert threshold: Average confidence <0.6 for 15 minutes (warning)
System Health (4 Metrics)
10. Event Processing Lag (observability.processing_lag_ms)
- Type: Gauge
- Target: <5000ms
- Why it matters: If event processing falls behind, monitoring data becomes stale. You’re flying blind during incidents.
- Collection: Time delta between event generation and indexing
- Alert threshold: >30,000ms (warning)
11. Heartbeat Miss Rate (heartbeat.miss_rate)
- Type: Gauge (percentage)
- Target: <5%
- Why it matters: Missed heartbeats precede worker failures. Rising miss rates predict imminent zombie workers.
- Collection: Percentage of expected heartbeats not received
- Alert threshold: >10% (warning), >25% (critical)
12. Daemon Health (daemon.health_status)
- Type: Gauge (binary: up/down per daemon)
- Target: All daemons up
- Why it matters: Core daemons (heartbeat monitor, observability hub, metrics aggregator) are single points of failure. If they’re down, monitoring goes dark.
- Collection: Process health checks every 60 seconds
- Alert threshold: Any daemon down (critical)
13. Failure Pattern Count (failure_pattern.active_patterns)
- Type: Counter
- Target: 0 high-severity patterns
- Why it matters: Recurring failure patterns indicate unresolved systemic issues. Same error repeating = tech debt.
- Collection: Pattern detection daemon analyzes failure events
- Alert threshold: >3 high-severity patterns (warning)
Performance Metrics (2 Metrics)
14. Mean Time to Recovery (MTTR) (dora.mttr_minutes)
- Type: Histogram
- Target: <60 minutes (elite: <60min, high: <1440min)
- Why it matters: MTTR measures how quickly you recover from incidents. Long MTTR = extended downtime.
- Collection: Time from failure detection to resolution
- Alert threshold: MTTR doubling baseline (warning)
15. Change Failure Rate (dora.change_failure_rate)
- Type: Gauge (percentage)
- Target: <5%
- Why it matters: High failure rates mean deployments break production. It creates fear of shipping.
- Collection: Failed deployments / total deployments
- Alert threshold: >10% (warning), >15% (critical)
The 8 Essential Alerts
Metrics are data. Alerts are action. Here are the 8 alerts that wake engineers (and why each one matters).
Critical Alerts (Immediate Response)
Alert 1: Zombie Threshold Exceeded
- Trigger:
worker.zombie_count > 10 - Severity: High
- SLA: 30 minutes
- Why: 10+ zombies means worker spawning is fundamentally broken. Task processing is grinding to a halt.
- Action: Auto-cleanup daemon kills zombies, attempts respawn. Alert escalates if cleanup fails.
- Example: “Zombie threshold exceeded: 44 zombies detected (threshold: 10)”
Alert 2: Worker Daemon Down
- Trigger: Worker daemon process not found for 120 seconds
- Severity: High
- SLA: 30 minutes
- Why: No daemon = no worker management = no task processing.
- Action: Attempt auto-restart. If restart fails, page on-call.
- Example: “Worker daemon process not found”
Alert 3: Observability Hub Offline
- Trigger: Observability hub daemon unreachable for 120 seconds
- Severity: High
- SLA: 30 minutes
- Why: If monitoring is down, you can’t see other failures. Monitoring blindness is catastrophic.
- Action: Restart observability hub. Flush backlog of unprocessed events.
- Example: “ObservabilityHub daemon heartbeat stale”
Alert 4: Routing Cascade Degraded
- Trigger:
routing.latency_msP95 > 300ms for 5 minutes - Severity: Warning (escalates to high after 15 minutes)
- SLA: 1 hour
- Why: Slow routing cascades into slow task processing. User experience degrades.
- Action: Check semantic search service health, verify RAG index freshness, inspect PyTorch model load.
- Example: “P95 routing latency exceeds 300ms - investigate cascade performance”
Warning Alerts (Proactive Investigation)
Alert 5: High Clarification Rate
- Trigger:
routing.clarification_rate > 5%over 5-minute window - Severity: Warning
- SLA: 4 hours
- Why: Users are getting frustrated with ambiguous responses. Routing confidence is too low.
- Action: Review recent clarification queries, retrain routing models if pattern emerges.
- Example: “Clarification rate exceeds 5% - routing cascade may need tuning”
Alert 6: Worker Failure Rate Spike
- Trigger:
worker.failure_rate > 5/hourfor 15 minutes - Severity: Warning (escalates to high if sustained >1 hour)
- SLA: 4 hours
- Why: Isolated failures are normal. Sustained failures indicate systemic issues.
- Action: Run failure pattern analysis, check for common error codes, review recent deployments.
- Example: “Worker failure rate exceeds 5/hour - check failure patterns”
Alert 7: Heartbeat Miss Rate Rising
- Trigger:
heartbeat.miss_rate > 10%over 10-minute window - Severity: Warning (predicts imminent zombie surge)
- SLA: 4 hours
- Why: Early warning signal. Heartbeats degrade before workers zombie out.
- Action: Investigate network issues, check worker resource consumption, verify heartbeat daemon health.
- Example: “Heartbeat miss rate 15% - workers may become unresponsive”
Informational Alerts (Track and Document)
Alert 8: Routing Layer Imbalance
- Trigger: Keyword layer handling <20% or >60% over 15 minutes
- Severity: Info
- SLA: 24 hours
- Why: Not urgent, but indicates routing cascade may need rebalancing.
- Action: Log for weekly review, check if query patterns have shifted.
- Example: “Keyword layer handling unusual percentage of queries - check routing patterns”
Real-Time Dashboard Architecture
Metrics and alerts flow through a three-layer architecture optimized for speed and reliability.
Layer 1: Collection (Zero-Overhead Instrumentation)
Design Principle: Metrics collection must add <5ms overhead per operation.
# Example: Recording worker spawn
record_gauge "worker.active_count" "$WORKER_COUNT" \
'{"worker_type":"implementation"}' "count"
# Example: Recording routing decision
record_histogram "routing.latency_ms" "$DURATION" \
'{"layer":"semantic","confidence":"0.87"}' "milliseconds"
Events are written to append-only JSONL files, partitioned by date. No blocking writes, no external dependencies.
File Structure:
coordination/observability/metrics/
├── raw/
│ ├── metrics-2025-12-12.jsonl # Today's raw metrics
│ └── metrics-2025-12-11.jsonl # Yesterday (compressed)
├── aggregated/
│ ├── hourly-2025-12-12.jsonl # Hourly rollups
│ └── daily-2025-12-11.jsonl # Daily rollups
└── indices/
├── by-trace-id.json # Fast trace lookup
└── by-worker-id.json # Fast worker lookup
Layer 2: Aggregation (Real-Time Processing)
Three daemons process metrics continuously:
Observability Hub Daemon (observability-hub-daemon.sh)
- Polls
all-events.jsonlevery 10 seconds - Builds correlation indices (trace_id → events, worker_id → events)
- Generates real-time summaries
- Processing lag target: <5 seconds
Heartbeat Monitor Daemon (heartbeat-monitor-daemon.sh)
- Polls active workers every 30 seconds
- Detects missed heartbeats (warning @ 60s, critical @ 120s)
- Marks zombies after 300s
- Triggers auto-cleanup for zombies
Metrics Aggregator Daemon (metrics-aggregator-daemon.sh)
- Creates hourly rollups every hour
- Creates daily rollups at midnight
- Cleans up old raw data (30-day retention)
- Keeps aggregated data indefinitely
Layer 3: Visualization (Elastic APM + Kibana)
Dashboard queries run against Elastic APM indices. Key visualizations:
Routing Cascade Performance Panel
- Pie chart: Layer distribution (actual vs target)
- Bar chart: Latency by layer (with threshold lines)
- Line graph: Confidence scores over time
- Metric: Clarification rate with alert threshold
Worker Fleet Health Panel
- Gauge: Active worker count (red if <2)
- Timeline: Worker spawn events
- Heatmap: Task duration distribution
- Table: Zombie workers (live update)
System Health Panel
- Status indicators: All daemon health (green/red)
- Line graph: Event processing lag
- Counter: Total events processed today
- Histogram: MTTR distribution
Failure Analysis Panel
- Top 10 error codes
- Failure pattern timeline
- Worker failure rate trend
- Auto-fix success rate
When to Alert vs When to Just Monitor
Not every metric deserves an alert. Here’s our decision framework:
Alert if:
- Imminent user-facing impact (routing down, workers exhausted)
- Data loss risk (observability hub offline, events unprocessed)
- Security concern (unexpected daemon restart, auth failures)
- Sustained degradation (15+ minutes of poor performance)
Monitor but don’t alert if:
- Transient spikes (single worker failure, momentary latency)
- Self-healing issues (zombie cleanup successful, daemon auto-restart)
- Informational trends (routing layer rebalancing, seasonal patterns)
- Below-threshold degradation (P95 latency 250ms when threshold is 300ms)
Real Example: Early versions alerted on every zombie worker. Engineers got 50+ alerts per day. We changed to alerting only when zombie count exceeds 10 and auto-cleanup fails. Alert volume dropped 90%, response time improved.
Alert Fatigue Prevention: Lessons from Production
Lesson 1: Alerts Must Be Actionable Bad alert: “Worker failure rate elevated” Good alert: “Worker failure rate 12/hour (threshold: 5) - Context injection errors detected. Runbook: docs/runbooks/context-injection-failure.md”
Lesson 2: Auto-Remediation Before Alerts Before alerting on zombie workers, auto-cleanup daemon attempts to kill and respawn. 80% of zombies resolve automatically. Alerts fire only on auto-remediation failure.
Lesson 3: Tiered Severity with SLA Critical (30min SLA) → High (4hr SLA) → Warning (24hr SLA) → Info (weekly review). Respects on-call engineer time while maintaining system health.
Lesson 4: Escalation Paths Warnings escalate to high after sustained duration. This catches slow-burn issues before they become incidents.
Lesson 5: Alert Consolidation Instead of alerting on 10 individual zombie workers, one alert covers “zombie threshold exceeded” with details. Reduces noise, increases signal.
Integration with Existing Tools
Elastic APM Integration
Cortex instruments routing decisions as APM transactions:
// Routing decision becomes APM transaction
apm.startTransaction('route_query', 'routing');
apm.setLabel('routing.layer', 'semantic');
apm.setLabel('routing.confidence', confidence);
apm.setLabel('routing.agent', routedAgent);
// ... routing logic ...
apm.endTransaction();
Benefits:
- Distributed tracing across master → worker → task chains
- Visual service maps showing routing cascade flow
- Built-in latency percentile calculations
- No custom aggregation code needed for basic metrics
Kibana Dashboard Setup
Import config/elastic-apm-dashboard.json for pre-built visualizations. Key queries:
Active Worker Count:
GET apm-*/_search
{
"query": {
"bool": {
"must": [
{"term": {"labels.worker.state": "active"}},
{"range": {"@timestamp": {"gte": "now-5m"}}}
]
}
},
"aggs": {
"worker_count": {"cardinality": {"field": "labels.worker.id"}}
}
}
Routing Latency Timeline:
labels.routing.layer: *
| timechart p50, p95, p99 by labels.routing.layer
Cost Implications
Elastic APM Costs (production scale):
- Events: ~50,000/day
- Storage: ~500MB/day compressed
- Retention: 30 days hot, 90 days warm
- Estimated cost: $150-200/month (Elastic Cloud)
Alternative: File-based observability (current implementation) costs $0 but requires more operational overhead. Trade-off: $200/month vs 10 hours/month engineer time managing custom dashboards.
Our Choice: Hybrid approach. Critical metrics flow through Elastic APM for real-time dashboards. Historical analysis and custom queries use file-based observability. Best of both worlds.
Cost vs Visibility Trade-offs
The Observability Budget
Every metric has a cost:
- Storage cost: ~1KB per event × 50k events/day = 50MB/day = $5/month storage
- Processing cost: Aggregation daemons consume ~2% CPU = negligible on modern hardware
- Query cost: Elastic APM charges per query. Heavy dashboard usage = higher bills.
- Engineer cost: Custom metrics require maintenance. Standard metrics are “free” (built into APM).
Decision Framework: High-traffic paths (routing, worker spawning) get instrumented heavily. Low-traffic paths (daemon restarts, manual interventions) rely on logs.
Sampling for High-Volume Metrics
At scale (>1M events/day), sampling becomes necessary:
- Worker heartbeats: Sample 10% (every 10th heartbeat logged)
- Routing decisions: Sample 100% for first month, 50% after patterns established
- Task completions: Always 100% (critical for billing/SLA tracking)
- Daemon health checks: Sample 10% (reduce noise in indices)
Rule of thumb: If a metric doesn’t inform a decision or alert, stop collecting it. Observability for observability’s sake is waste.
The Vendor Lock-In Question
Elastic APM is convenient but proprietary. Our mitigation:
- Dual-write strategy: Critical metrics write to both Elastic APM and local JSONL files
- Standard formats: Use OpenTelemetry for instrumentation (can swap backends)
- Export capability: Daily jobs export Elastic data to S3 (disaster recovery + cost reduction)
- Gradual migration path: Can move to Prometheus/Grafana if Elastic costs spike
Production Lessons: What We Learned the Hard Way
Lesson 1: Zombie workers are inevitable. Early versions tried to prevent zombies through perfect coordination. Waste of time. Instead, detect and cleanup zombies aggressively. 5-minute zombie threshold works well.
Lesson 2: Routing cascade needs constant tuning. Query patterns shift. New agents get added. Confidence thresholds drift. Weekly reviews of routing layer distribution prevent slow degradation.
Lesson 3: Daemon health is single point of failure. If heartbeat monitor dies, zombie workers accumulate undetected. Daemon-monitoring-the-monitor-daemon is essential (systemd/launchd auto-restart).
Lesson 4: Alerts must include runbooks. “Worker failure rate elevated” is useless without next steps. Every alert links to runbook with investigation steps and common fixes.
Lesson 5: Historical data is debugging gold. When investigating a production incident, being able to query “show me all routing decisions with confidence <0.5 in the last hour” is invaluable. Invest in queryable observability.
Wrapping Up: Observability as a Product
The best monitoring system is the one engineers actually use. Our dashboard gets checked 20+ times per day. Alerts get investigated within SLA 95% of the time. When incidents happen, MTTR averages 23 minutes (down from 4 hours pre-monitoring).
This didn’t happen by accident. We treated observability as a product:
- Fast queries (<1 second response time)
- Intuitive visualizations (no training required)
- Actionable alerts (clear next steps)
- Low false-positive rate (<2% of alerts are noise)
If you’re building a distributed AI agent system, don’t bolt on monitoring as an afterthought. Design observability into the architecture from day one. Your future self—debugging a production incident at 2 AM—will thank you.
Next in the series: “Automatic Recovery: How Cortex Heals Itself Without Human Intervention” - deep dive into auto-remediation patterns that reduce MTTR to minutes.