Complete Task Lineage: 18 Event Types That Give You Total Visibility

When you’re running a multi-agent AI system, the question isn’t if something will go wrong—it’s when. And when it does, you need answers fast. Why did this task fail? Which worker processed it? Was it reassigned? Did a handoff happen?

This is where task lineage tracking becomes essential. In Cortex, we built a comprehensive lineage system with 18 distinct event types that capture every state transition, every actor change, and every milestone in a task’s lifecycle. The result? Complete visibility into what happened, who did it, and why—all queryable in under 200ms.

What Is Task Lineage for AI Agents?

Task lineage is the complete audit trail of a task’s journey through your system. For AI agents, this means tracking:

Task lifecycle: From creation through completion or failure
Worker execution: Spawning, progress updates, and termination
State transitions: Blocking, unblocking, escalation, cancellation
Cross-master handoffs: When tasks move between specialist agents
Actor accountability: Who (or what) triggered each event

Think of it as Git blame for task execution—every change, every transition, every decision is recorded with full context.

The 18 Event Types: Complete Coverage

Cortex’s lineage system categorizes events into five logical groups, covering every possible state transition:

Core Task Lifecycle (6 Events)

These events track the fundamental task journey:

Event Type	Triggered When	Actor	Key Data
`task_created`	User or system creates a task	User/System	Priority, metadata
`task_assigned`	Coordinator assigns to a master	Coordinator	Master ID, priority
`task_started`	Master begins execution	Master	Start timestamp
`task_completed`	Task finishes successfully	Master	Deliverables, duration
`task_failed`	Task execution fails	Master/Worker	Error details, stack trace
`task_cancelled`	Task is cancelled	User/System	Cancellation reason

Example Flow:

task_created → task_assigned → task_started → task_completed
    (user)        (coordinator)     (master)        (master)

Worker Execution (5 Events)

Workers are ephemeral agents spawned to execute specific sub-tasks. These events track their lifecycle:

Event Type	Triggered When	Key Data
`worker_spawned`	Master creates a worker	Worker ID, worker type
`worker_started`	Worker begins execution	Start timestamp
`worker_progress`	Worker reports progress	Progress %, intermediate results
`worker_completed`	Worker finishes successfully	Token usage, deliverables
`worker_failed`	Worker encounters error	Error type, message, recovery hints

Why Track Workers Separately?

A single task might spawn dozens of workers. Tracking them individually lets you:

Identify which specific worker failed in a batch
Measure token consumption per worker type
Detect performance regressions in specific worker implementations
Calculate parallel execution efficiency

State Transitions (4 Events)

Tasks don’t always follow a linear path. These events capture complications:

Event Type	Triggered When	Why It Matters
`task_blocked`	Task waits for dependency	Reveals bottlenecks, dependency chains
`task_unblocked`	Blocking condition resolves	Measures wait times
`task_reassigned`	Task moves to different master	Tracks load balancing, failures
`task_escalated`	Requires manual intervention	Critical quality gate failures

Real Debugging Scenario:

A security scan task is stuck “in progress” for 3 hours. Lineage reveals:

{
  "event_type": "task_blocked",
  "reason": "waiting_for_credential_rotation",
  "timestamp": "2025-11-27T14:23:00Z"
}

Without lineage, you’d be blind. With it, you know exactly where to look.

Cross-Master Handoffs (3 Events)

When a task needs expertise from multiple masters (e.g., Development → Security → Documentation), handoffs track the transition:

Event Type	Triggered When	Data Captured
`handoff_created`	Source master initiates handoff	From/to masters, handoff ID
`handoff_accepted`	Target master accepts	Acceptance timestamp
`handoff_completed`	Handoff work finishes	Deliverables from target master

Handoff Flow Diagram:

Development Master (creates feature)
         ↓ handoff_created
    [Handoff Queue]
         ↓ handoff_accepted
Security Master (reviews code)
         ↓ handoff_completed
Documentation Master (updates docs)

Each handoff creates a clear separation of responsibilities with full audit trail.

Event-Driven Architecture Benefits

Cortex’s lineage system uses an append-only JSONL log with several key advantages:

1. Write Performance: ~5ms Per Event

async recordOperation(operation) {
  const lineageRecord = {
    id: this.generateLineageId(),
    session_id: this.sessionId,
    timestamp: new Date().toISOString(),
    type: operation.type,
    source: operation.source,
    target: operation.target,
    actor: operation.actor,
    metadata: {
      git_commit: await this.getCurrentGitCommit(),
      hostname: require('os').hostname(),
      process_id: process.pid
    }
  };

  this.operationBuffer.push(lineageRecord);

  if (this.operationBuffer.length >= this.bufferSize) {
    await this.flush();
  }
}

Events are buffered (default: 100 events) and batch-written to disk. This minimizes I/O overhead while maintaining near-real-time visibility.

2. Schema Flexibility with JSON

Each event type has custom event_data:

{
  "event_type": "worker_spawned",
  "event_data": {
    "worker_id": "worker-scan-001",
    "worker_type": "scan-worker"
  }
}

{
  "event_type": "task_failed",
  "event_data": {
    "error_details": {
      "error_type": "ValidationError",
      "error_message": "Missing required field: credentials",
      "stack_trace": "..."
    }
  }
}

This flexibility lets each event capture exactly what’s relevant without forcing a rigid schema.

3. Immutable Audit Trail

JSONL append-only logs mean:

No lost history: Events are never deleted or modified
Tamper evidence: Each event has a SHA-256 checksum
Compliance ready: 7-year retention for security events, 3 years for others
Easy archival: Rotate to daily files (lineage-2025-11-27.jsonl)

Query Performance: Sub-200ms Target

Logging is only half the story. You need to query that data fast. Cortex achieves sub-200ms queries through:

Index-Based Lookups

// In-memory index maps entities to line offsets
const index = {
  entities: {
    'task-security-scan-001': {
      operations: 47,
      last_access: '2025-11-27T14:23:00Z'
    }
  },
  actors: {
    'security-master': {
      operations: 234,
      last_operation: '2025-11-27T14:30:00Z'
    }
  }
};

Before scanning the entire log, check the index. If the entity doesn’t exist, return immediately.

LRU Cache for Hot Queries

class LRUCache {
  constructor(maxSize = 100) {
    this.cache = new Map();
    this.maxSize = maxSize;
  }

  get(key) {
    if (!this.cache.has(key)) return null;

    // Move to end (most recently used)
    const value = this.cache.get(key);
    this.cache.delete(key);
    this.cache.set(key, value);
    return value;
  }
}

Frequently queried tasks (e.g., monitoring dashboards checking current tasks) are served from memory.

Streaming Reads with Early Exit

const fileStream = fsSync.createReadStream(LINEAGE_LOG);
const rl = readline.createInterface({ input: fileStream });

for await (const line of rl) {
  const record = JSON.parse(line);

  if (record.task_id === targetTask) {
    results.push(record);

    if (results.length >= limit) {
      rl.close();
      fileStream.close();
      break; // Stop reading early
    }
  }
}

No need to load the entire 500MB log file into memory—stream it and stop when you have enough results.

Performance Benchmark Results

Query Type            | Cold Cache | Warm Cache | Target
---------------------|-----------|-----------|--------
Single task (100 events) | 45ms   | 3ms      | 200ms
Actor query (500 events) | 112ms  | 8ms      | 200ms
Time range (1000 events) | 187ms  | 15ms     | 200ms

All queries meet the 200ms target, even on cold cache.

Real Debugging Scenarios

Scenario 1: Task Stuck “In Progress”

Problem: Task task-deploy-staging-042 shows “in progress” for 2 hours but no activity.

Lineage Query:

./scripts/query-lineage.sh --task task-deploy-staging-042 --timeline

Output:

2025-11-27 12:00:00  task_created        (user-ryan)
2025-11-27 12:00:05  task_assigned       (coordinator → deployment-master)
2025-11-27 12:00:10  task_started        (deployment-master)
2025-11-27 12:00:15  worker_spawned      (worker-deploy-001)
2025-11-27 12:00:20  worker_started      (worker-deploy-001)
2025-11-27 12:15:30  worker_failed       (worker-deploy-001)
   └─ error: "Connection timeout to staging cluster"
2025-11-27 12:15:35  task_blocked        (reason: "retry_backoff")

Root Cause: Worker failed due to network timeout. Task is in exponential backoff retry. Not stuck—just waiting.

Fix: Check network connectivity to staging cluster or manually unblock with higher timeout.

Scenario 2: Mysterious Task Reassignment

Problem: Task completed by documentation-master but was assigned to development-master.

Lineage Query:

./scripts/query-lineage.sh --task task-feature-001

Key Events:

[
  {
    "event_type": "task_assigned",
    "event_data": { "master_id": "development-master" },
    "timestamp": "2025-11-27T10:00:00Z"
  },
  {
    "event_type": "handoff_created",
    "event_data": {
      "from_master": "development-master",
      "to_master": "documentation-master",
      "reason": "code_complete_needs_docs"
    },
    "timestamp": "2025-11-27T10:30:00Z"
  },
  {
    "event_type": "handoff_accepted",
    "actor": { "type": "master", "id": "documentation-master" },
    "timestamp": "2025-11-27T10:30:05Z"
  }
]

Root Cause: Not a reassignment—a handoff. Development completed code, handed off to Documentation for README updates. Working as designed.

Scenario 3: Token Budget Overrun

Problem: Monthly token budget hit limit on the 15th of the month.

Lineage Query:

// Aggregate token usage from worker_completed events
const events = await lineageQuery.queryByTimeRange(
  '2025-11-01T00:00:00Z',
  '2025-11-15T23:59:59Z'
);

const tokensByMaster = {};
events
  .filter(e => e.event_type === 'worker_completed')
  .forEach(e => {
    const master = e.event_data.master_id;
    const tokens = e.event_data.token_usage?.total_tokens || 0;
    tokensByMaster[master] = (tokensByMaster[master] || 0) + tokens;
  });

console.log(tokensByMaster);

Output:

{
  "development-master": 450000,
  "security-master": 1200000,  // ← Culprit
  "documentation-master": 50000
}

Root Cause: Security master’s code review workers used 1.2M tokens—70% of monthly budget. Reviews were running on every commit, including tiny typo fixes.

Fix: Implement smart review triggers—skip reviews for docs-only changes.

Performance Overhead Considerations

Write Overhead

Lineage tracking adds ~5-10ms per event to task execution. For a typical task with 10 events (created, assigned, started, 5 worker events, completed), that’s 50-100ms total—negligible compared to actual LLM inference time (1-10 seconds).

Mitigation:

Buffered writes (100 events before flush)
Async logging (non-blocking)
Disable in performance-critical paths (rare)

Storage Growth

At ~500 bytes per event:

1,000 tasks/day × 10 events/task × 500 bytes = 5MB/day
Annual: ~1.8GB
With daily rotation and compression: ~500MB/year

Archival Strategy:

# Rotate daily
mv lineage.jsonl lineage-$(date +%Y-%m-%d).jsonl
gzip lineage-$(date -d '7 days ago' +%Y-%m-%d).jsonl

# Archive to S3 after 30 days
aws s3 cp lineage-2025-10-*.jsonl.gz s3://cortex-archives/lineage/

Query Load

The index file grows with unique entities/actors. At 10,000 tracked entities:

Index size: ~500KB (easily fits in memory)
Index load time: ~10ms
Index TTL refresh: 1 minute (configurable)

For systems tracking millions of entities, consider:

Sharded indexes (by date range)
SQLite for index storage (B-tree lookups)
Read replicas for dashboards

Building Your Own Lineage System

Want to implement task lineage in your own AI agent framework? Here’s the blueprint:

1. Define Your Event Schema

Start with the minimum viable set:

type LineageEvent = {
  lineage_id: string;
  task_id: string;
  event_type: 'created' | 'started' | 'completed' | 'failed';
  timestamp: string; // ISO-8601
  actor: {
    type: 'user' | 'system' | 'agent';
    id: string;
  };
  event_data?: Record<string, any>;
};

Add more event types as your system grows.

2. Choose Your Storage Backend

JSONL (Cortex’s approach):

Pros: Simple, portable, easy to parse, diff-friendly
Cons: No built-in indexing, manual retention management
Best for: <1M events, simple queries

SQLite:

Pros: Indexed queries, transactions, relations
Cons: Write contention at scale, harder to archive
Best for: 1M-10M events, complex queries

PostgreSQL + TimescaleDB:

Pros: Time-series optimization, retention policies, distributed queries
Cons: Infrastructure overhead
Best for: 10M+ events, analytics, multi-tenancy

3. Implement Instrumentation Points

Inject lineage tracking at key lifecycle hooks:

class TaskExecutor {
  async execute(task) {
    // Log task start
    await lineage.recordEvent({
      task_id: task.id,
      event_type: 'task_started',
      actor: { type: 'system', id: 'executor' }
    });

    try {
      const result = await this.runTask(task);

      // Log completion
      await lineage.recordEvent({
        task_id: task.id,
        event_type: 'task_completed',
        event_data: {
          duration_ms: Date.now() - task.start_time,
          deliverables: result.outputs
        }
      });

      return result;
    } catch (error) {
      // Log failure
      await lineage.recordEvent({
        task_id: task.id,
        event_type: 'task_failed',
        event_data: {
          error_type: error.constructor.name,
          error_message: error.message
        }
      });

      throw error;
    }
  }
}

4. Build Query Utilities

Expose queries your users actually need:

class LineageQuery {
  // Get all events for a task
  async getTaskLineage(taskId) { /* ... */ }

  // Find tasks by actor (who did what?)
  async getTasksByActor(actorId) { /* ... */ }

  // Find failures in time range (what broke recently?)
  async getFailures(startTime, endTime) { /* ... */ }

  // Aggregate metrics (how many tasks completed today?)
  async getMetrics(timeRange) { /* ... */ }
}

5. Optimize for Your Access Patterns

If you query by task_id most often:

Index on task_id
Partition by task creation date
Cache recent task lineages

If you query by actor:

Secondary index on actor.id
Inverted index (actor → task IDs)

If you need time-series analytics:

Use columnar storage (Parquet)
Pre-aggregate metrics (daily summaries)

The Bottom Line

Task lineage isn’t optional for production AI agent systems—it’s foundational. When (not if) your agents misbehave, you need to know:

What happened: Full event timeline
Who did it: Actor accountability
Why it happened: State transitions and errors
How to fix it: Replay, debug, prevent

Cortex’s 18-event lineage system gives you this visibility with minimal overhead (~5ms per event) and fast queries (sub-200ms). Whether you’re debugging a stuck task, tracking token usage, or generating compliance reports, lineage data is your source of truth.

Start simple—track task creation, start, and completion. Add worker events and state transitions as you need them. Before long, you’ll wonder how you ever debugged distributed AI agents without it.

Next in Series: Cortex’s Auto-Learning System: Feedback Loops That Actually Work - How we use lineage data to automatically improve agent performance.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Infrastructure as a Fabric: How a Qdrant MCP Server Led Me to Rethink Everything

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Zero-Trust Networking Patterns for Kubernetes Clusters

What Is Task Lineage for AI Agents?

The 18 Event Types: Complete Coverage

Core Task Lifecycle (6 Events)

Worker Execution (5 Events)

State Transitions (4 Events)

Cross-Master Handoffs (3 Events)

Event-Driven Architecture Benefits

1. Write Performance: ~5ms Per Event

2. Schema Flexibility with JSON

3. Immutable Audit Trail

Query Performance: Sub-200ms Target

Index-Based Lookups

LRU Cache for Hot Queries

Streaming Reads with Early Exit

Performance Benchmark Results

Real Debugging Scenarios

Scenario 1: Task Stuck “In Progress”

Scenario 2: Mysterious Task Reassignment

Scenario 3: Token Budget Overrun

Performance Overhead Considerations

Write Overhead

Storage Growth

Query Load

Building Your Own Lineage System

1. Define Your Event Schema

2. Choose Your Storage Backend

3. Implement Instrumentation Points

4. Build Query Utilities

5. Optimize for Your Access Patterns

The Bottom Line