Complete Task Lineage: 18 Event Types That Give You Total Visibility
When you’re running a multi-agent AI system, the question isn’t if something will go wrong—it’s when. And when it does, you need answers fast. Why did this task fail? Which worker processed it? Was it reassigned? Did a handoff happen?
This is where task lineage tracking becomes essential. In Cortex, we built a comprehensive lineage system with 18 distinct event types that capture every state transition, every actor change, and every milestone in a task’s lifecycle. The result? Complete visibility into what happened, who did it, and why—all queryable in under 200ms.
What Is Task Lineage for AI Agents?
Task lineage is the complete audit trail of a task’s journey through your system. For AI agents, this means tracking:
- Task lifecycle: From creation through completion or failure
- Worker execution: Spawning, progress updates, and termination
- State transitions: Blocking, unblocking, escalation, cancellation
- Cross-master handoffs: When tasks move between specialist agents
- Actor accountability: Who (or what) triggered each event
Think of it as Git blame for task execution—every change, every transition, every decision is recorded with full context.
The 18 Event Types: Complete Coverage
Cortex’s lineage system categorizes events into five logical groups, covering every possible state transition:
Core Task Lifecycle (6 Events)
These events track the fundamental task journey:
| Event Type | Triggered When | Actor | Key Data |
|---|---|---|---|
task_created | User or system creates a task | User/System | Priority, metadata |
task_assigned | Coordinator assigns to a master | Coordinator | Master ID, priority |
task_started | Master begins execution | Master | Start timestamp |
task_completed | Task finishes successfully | Master | Deliverables, duration |
task_failed | Task execution fails | Master/Worker | Error details, stack trace |
task_cancelled | Task is cancelled | User/System | Cancellation reason |
Example Flow:
task_created → task_assigned → task_started → task_completed
(user) (coordinator) (master) (master)
Worker Execution (5 Events)
Workers are ephemeral agents spawned to execute specific sub-tasks. These events track their lifecycle:
| Event Type | Triggered When | Key Data |
|---|---|---|
worker_spawned | Master creates a worker | Worker ID, worker type |
worker_started | Worker begins execution | Start timestamp |
worker_progress | Worker reports progress | Progress %, intermediate results |
worker_completed | Worker finishes successfully | Token usage, deliverables |
worker_failed | Worker encounters error | Error type, message, recovery hints |
Why Track Workers Separately?
A single task might spawn dozens of workers. Tracking them individually lets you:
- Identify which specific worker failed in a batch
- Measure token consumption per worker type
- Detect performance regressions in specific worker implementations
- Calculate parallel execution efficiency
State Transitions (4 Events)
Tasks don’t always follow a linear path. These events capture complications:
| Event Type | Triggered When | Why It Matters |
|---|---|---|
task_blocked | Task waits for dependency | Reveals bottlenecks, dependency chains |
task_unblocked | Blocking condition resolves | Measures wait times |
task_reassigned | Task moves to different master | Tracks load balancing, failures |
task_escalated | Requires manual intervention | Critical quality gate failures |
Real Debugging Scenario:
A security scan task is stuck “in progress” for 3 hours. Lineage reveals:
{
"event_type": "task_blocked",
"reason": "waiting_for_credential_rotation",
"timestamp": "2025-11-27T14:23:00Z"
}
Without lineage, you’d be blind. With it, you know exactly where to look.
Cross-Master Handoffs (3 Events)
When a task needs expertise from multiple masters (e.g., Development → Security → Documentation), handoffs track the transition:
| Event Type | Triggered When | Data Captured |
|---|---|---|
handoff_created | Source master initiates handoff | From/to masters, handoff ID |
handoff_accepted | Target master accepts | Acceptance timestamp |
handoff_completed | Handoff work finishes | Deliverables from target master |
Handoff Flow Diagram:
Development Master (creates feature)
↓ handoff_created
[Handoff Queue]
↓ handoff_accepted
Security Master (reviews code)
↓ handoff_completed
Documentation Master (updates docs)
Each handoff creates a clear separation of responsibilities with full audit trail.
Event-Driven Architecture Benefits
Cortex’s lineage system uses an append-only JSONL log with several key advantages:
1. Write Performance: ~5ms Per Event
async recordOperation(operation) {
const lineageRecord = {
id: this.generateLineageId(),
session_id: this.sessionId,
timestamp: new Date().toISOString(),
type: operation.type,
source: operation.source,
target: operation.target,
actor: operation.actor,
metadata: {
git_commit: await this.getCurrentGitCommit(),
hostname: require('os').hostname(),
process_id: process.pid
}
};
this.operationBuffer.push(lineageRecord);
if (this.operationBuffer.length >= this.bufferSize) {
await this.flush();
}
}
Events are buffered (default: 100 events) and batch-written to disk. This minimizes I/O overhead while maintaining near-real-time visibility.
2. Schema Flexibility with JSON
Each event type has custom event_data:
{
"event_type": "worker_spawned",
"event_data": {
"worker_id": "worker-scan-001",
"worker_type": "scan-worker"
}
}
{
"event_type": "task_failed",
"event_data": {
"error_details": {
"error_type": "ValidationError",
"error_message": "Missing required field: credentials",
"stack_trace": "..."
}
}
}
This flexibility lets each event capture exactly what’s relevant without forcing a rigid schema.
3. Immutable Audit Trail
JSONL append-only logs mean:
- No lost history: Events are never deleted or modified
- Tamper evidence: Each event has a SHA-256 checksum
- Compliance ready: 7-year retention for security events, 3 years for others
- Easy archival: Rotate to daily files (
lineage-2025-11-27.jsonl)
Query Performance: Sub-200ms Target
Logging is only half the story. You need to query that data fast. Cortex achieves sub-200ms queries through:
Index-Based Lookups
// In-memory index maps entities to line offsets
const index = {
entities: {
'task-security-scan-001': {
operations: 47,
last_access: '2025-11-27T14:23:00Z'
}
},
actors: {
'security-master': {
operations: 234,
last_operation: '2025-11-27T14:30:00Z'
}
}
};
Before scanning the entire log, check the index. If the entity doesn’t exist, return immediately.
LRU Cache for Hot Queries
class LRUCache {
constructor(maxSize = 100) {
this.cache = new Map();
this.maxSize = maxSize;
}
get(key) {
if (!this.cache.has(key)) return null;
// Move to end (most recently used)
const value = this.cache.get(key);
this.cache.delete(key);
this.cache.set(key, value);
return value;
}
}
Frequently queried tasks (e.g., monitoring dashboards checking current tasks) are served from memory.
Streaming Reads with Early Exit
const fileStream = fsSync.createReadStream(LINEAGE_LOG);
const rl = readline.createInterface({ input: fileStream });
for await (const line of rl) {
const record = JSON.parse(line);
if (record.task_id === targetTask) {
results.push(record);
if (results.length >= limit) {
rl.close();
fileStream.close();
break; // Stop reading early
}
}
}
No need to load the entire 500MB log file into memory—stream it and stop when you have enough results.
Performance Benchmark Results
Query Type | Cold Cache | Warm Cache | Target
---------------------|-----------|-----------|--------
Single task (100 events) | 45ms | 3ms | 200ms
Actor query (500 events) | 112ms | 8ms | 200ms
Time range (1000 events) | 187ms | 15ms | 200ms
All queries meet the 200ms target, even on cold cache.
Real Debugging Scenarios
Scenario 1: Task Stuck “In Progress”
Problem: Task task-deploy-staging-042 shows “in progress” for 2 hours but no activity.
Lineage Query:
./scripts/query-lineage.sh --task task-deploy-staging-042 --timeline
Output:
2025-11-27 12:00:00 task_created (user-ryan)
2025-11-27 12:00:05 task_assigned (coordinator → deployment-master)
2025-11-27 12:00:10 task_started (deployment-master)
2025-11-27 12:00:15 worker_spawned (worker-deploy-001)
2025-11-27 12:00:20 worker_started (worker-deploy-001)
2025-11-27 12:15:30 worker_failed (worker-deploy-001)
└─ error: "Connection timeout to staging cluster"
2025-11-27 12:15:35 task_blocked (reason: "retry_backoff")
Root Cause: Worker failed due to network timeout. Task is in exponential backoff retry. Not stuck—just waiting.
Fix: Check network connectivity to staging cluster or manually unblock with higher timeout.
Scenario 2: Mysterious Task Reassignment
Problem: Task completed by documentation-master but was assigned to development-master.
Lineage Query:
./scripts/query-lineage.sh --task task-feature-001
Key Events:
[
{
"event_type": "task_assigned",
"event_data": { "master_id": "development-master" },
"timestamp": "2025-11-27T10:00:00Z"
},
{
"event_type": "handoff_created",
"event_data": {
"from_master": "development-master",
"to_master": "documentation-master",
"reason": "code_complete_needs_docs"
},
"timestamp": "2025-11-27T10:30:00Z"
},
{
"event_type": "handoff_accepted",
"actor": { "type": "master", "id": "documentation-master" },
"timestamp": "2025-11-27T10:30:05Z"
}
]
Root Cause: Not a reassignment—a handoff. Development completed code, handed off to Documentation for README updates. Working as designed.
Scenario 3: Token Budget Overrun
Problem: Monthly token budget hit limit on the 15th of the month.
Lineage Query:
// Aggregate token usage from worker_completed events
const events = await lineageQuery.queryByTimeRange(
'2025-11-01T00:00:00Z',
'2025-11-15T23:59:59Z'
);
const tokensByMaster = {};
events
.filter(e => e.event_type === 'worker_completed')
.forEach(e => {
const master = e.event_data.master_id;
const tokens = e.event_data.token_usage?.total_tokens || 0;
tokensByMaster[master] = (tokensByMaster[master] || 0) + tokens;
});
console.log(tokensByMaster);
Output:
{
"development-master": 450000,
"security-master": 1200000, // ← Culprit
"documentation-master": 50000
}
Root Cause: Security master’s code review workers used 1.2M tokens—70% of monthly budget. Reviews were running on every commit, including tiny typo fixes.
Fix: Implement smart review triggers—skip reviews for docs-only changes.
Performance Overhead Considerations
Write Overhead
Lineage tracking adds ~5-10ms per event to task execution. For a typical task with 10 events (created, assigned, started, 5 worker events, completed), that’s 50-100ms total—negligible compared to actual LLM inference time (1-10 seconds).
Mitigation:
- Buffered writes (100 events before flush)
- Async logging (non-blocking)
- Disable in performance-critical paths (rare)
Storage Growth
At ~500 bytes per event:
- 1,000 tasks/day × 10 events/task × 500 bytes = 5MB/day
- Annual: ~1.8GB
- With daily rotation and compression: ~500MB/year
Archival Strategy:
# Rotate daily
mv lineage.jsonl lineage-$(date +%Y-%m-%d).jsonl
gzip lineage-$(date -d '7 days ago' +%Y-%m-%d).jsonl
# Archive to S3 after 30 days
aws s3 cp lineage-2025-10-*.jsonl.gz s3://cortex-archives/lineage/
Query Load
The index file grows with unique entities/actors. At 10,000 tracked entities:
- Index size: ~500KB (easily fits in memory)
- Index load time: ~10ms
- Index TTL refresh: 1 minute (configurable)
For systems tracking millions of entities, consider:
- Sharded indexes (by date range)
- SQLite for index storage (B-tree lookups)
- Read replicas for dashboards
Building Your Own Lineage System
Want to implement task lineage in your own AI agent framework? Here’s the blueprint:
1. Define Your Event Schema
Start with the minimum viable set:
type LineageEvent = {
lineage_id: string;
task_id: string;
event_type: 'created' | 'started' | 'completed' | 'failed';
timestamp: string; // ISO-8601
actor: {
type: 'user' | 'system' | 'agent';
id: string;
};
event_data?: Record<string, any>;
};
Add more event types as your system grows.
2. Choose Your Storage Backend
JSONL (Cortex’s approach):
- Pros: Simple, portable, easy to parse, diff-friendly
- Cons: No built-in indexing, manual retention management
- Best for: <1M events, simple queries
SQLite:
- Pros: Indexed queries, transactions, relations
- Cons: Write contention at scale, harder to archive
- Best for: 1M-10M events, complex queries
PostgreSQL + TimescaleDB:
- Pros: Time-series optimization, retention policies, distributed queries
- Cons: Infrastructure overhead
- Best for: 10M+ events, analytics, multi-tenancy
3. Implement Instrumentation Points
Inject lineage tracking at key lifecycle hooks:
class TaskExecutor {
async execute(task) {
// Log task start
await lineage.recordEvent({
task_id: task.id,
event_type: 'task_started',
actor: { type: 'system', id: 'executor' }
});
try {
const result = await this.runTask(task);
// Log completion
await lineage.recordEvent({
task_id: task.id,
event_type: 'task_completed',
event_data: {
duration_ms: Date.now() - task.start_time,
deliverables: result.outputs
}
});
return result;
} catch (error) {
// Log failure
await lineage.recordEvent({
task_id: task.id,
event_type: 'task_failed',
event_data: {
error_type: error.constructor.name,
error_message: error.message
}
});
throw error;
}
}
}
4. Build Query Utilities
Expose queries your users actually need:
class LineageQuery {
// Get all events for a task
async getTaskLineage(taskId) { /* ... */ }
// Find tasks by actor (who did what?)
async getTasksByActor(actorId) { /* ... */ }
// Find failures in time range (what broke recently?)
async getFailures(startTime, endTime) { /* ... */ }
// Aggregate metrics (how many tasks completed today?)
async getMetrics(timeRange) { /* ... */ }
}
5. Optimize for Your Access Patterns
If you query by task_id most often:
- Index on
task_id - Partition by task creation date
- Cache recent task lineages
If you query by actor:
- Secondary index on
actor.id - Inverted index (actor → task IDs)
If you need time-series analytics:
- Use columnar storage (Parquet)
- Pre-aggregate metrics (daily summaries)
The Bottom Line
Task lineage isn’t optional for production AI agent systems—it’s foundational. When (not if) your agents misbehave, you need to know:
- What happened: Full event timeline
- Who did it: Actor accountability
- Why it happened: State transitions and errors
- How to fix it: Replay, debug, prevent
Cortex’s 18-event lineage system gives you this visibility with minimal overhead (~5ms per event) and fast queries (sub-200ms). Whether you’re debugging a stuck task, tracking token usage, or generating compliance reports, lineage data is your source of truth.
Start simple—track task creation, start, and completion. Add worker events and state transitions as you need them. Before long, you’ll wonder how you ever debugged distributed AI agents without it.
Next in Series: Cortex’s Auto-Learning System: Feedback Loops That Actually Work - How we use lineage data to automatically improve agent performance.