Skip to main content

Complete Task Lineage: 18 Event Types That Give You Total Visibility

Ryan Dahlberg
Ryan Dahlberg
December 16, 2025 11 min read
Share:
Complete Task Lineage: 18 Event Types That Give You Total Visibility

When you’re running a multi-agent AI system, the question isn’t if something will go wrong—it’s when. And when it does, you need answers fast. Why did this task fail? Which worker processed it? Was it reassigned? Did a handoff happen?

This is where task lineage tracking becomes essential. In Cortex, we built a comprehensive lineage system with 18 distinct event types that capture every state transition, every actor change, and every milestone in a task’s lifecycle. The result? Complete visibility into what happened, who did it, and why—all queryable in under 200ms.

What Is Task Lineage for AI Agents?

Task lineage is the complete audit trail of a task’s journey through your system. For AI agents, this means tracking:

  • Task lifecycle: From creation through completion or failure
  • Worker execution: Spawning, progress updates, and termination
  • State transitions: Blocking, unblocking, escalation, cancellation
  • Cross-master handoffs: When tasks move between specialist agents
  • Actor accountability: Who (or what) triggered each event

Think of it as Git blame for task execution—every change, every transition, every decision is recorded with full context.

The 18 Event Types: Complete Coverage

Cortex’s lineage system categorizes events into five logical groups, covering every possible state transition:

Core Task Lifecycle (6 Events)

These events track the fundamental task journey:

Event TypeTriggered WhenActorKey Data
task_createdUser or system creates a taskUser/SystemPriority, metadata
task_assignedCoordinator assigns to a masterCoordinatorMaster ID, priority
task_startedMaster begins executionMasterStart timestamp
task_completedTask finishes successfullyMasterDeliverables, duration
task_failedTask execution failsMaster/WorkerError details, stack trace
task_cancelledTask is cancelledUser/SystemCancellation reason

Example Flow:

task_created → task_assigned → task_started → task_completed
    (user)        (coordinator)     (master)        (master)

Worker Execution (5 Events)

Workers are ephemeral agents spawned to execute specific sub-tasks. These events track their lifecycle:

Event TypeTriggered WhenKey Data
worker_spawnedMaster creates a workerWorker ID, worker type
worker_startedWorker begins executionStart timestamp
worker_progressWorker reports progressProgress %, intermediate results
worker_completedWorker finishes successfullyToken usage, deliverables
worker_failedWorker encounters errorError type, message, recovery hints

Why Track Workers Separately?

A single task might spawn dozens of workers. Tracking them individually lets you:

  • Identify which specific worker failed in a batch
  • Measure token consumption per worker type
  • Detect performance regressions in specific worker implementations
  • Calculate parallel execution efficiency

State Transitions (4 Events)

Tasks don’t always follow a linear path. These events capture complications:

Event TypeTriggered WhenWhy It Matters
task_blockedTask waits for dependencyReveals bottlenecks, dependency chains
task_unblockedBlocking condition resolvesMeasures wait times
task_reassignedTask moves to different masterTracks load balancing, failures
task_escalatedRequires manual interventionCritical quality gate failures

Real Debugging Scenario:

A security scan task is stuck “in progress” for 3 hours. Lineage reveals:

{
  "event_type": "task_blocked",
  "reason": "waiting_for_credential_rotation",
  "timestamp": "2025-11-27T14:23:00Z"
}

Without lineage, you’d be blind. With it, you know exactly where to look.

Cross-Master Handoffs (3 Events)

When a task needs expertise from multiple masters (e.g., Development → Security → Documentation), handoffs track the transition:

Event TypeTriggered WhenData Captured
handoff_createdSource master initiates handoffFrom/to masters, handoff ID
handoff_acceptedTarget master acceptsAcceptance timestamp
handoff_completedHandoff work finishesDeliverables from target master

Handoff Flow Diagram:

Development Master (creates feature)
         ↓ handoff_created
    [Handoff Queue]
         ↓ handoff_accepted
Security Master (reviews code)
         ↓ handoff_completed
Documentation Master (updates docs)

Each handoff creates a clear separation of responsibilities with full audit trail.

Event-Driven Architecture Benefits

Cortex’s lineage system uses an append-only JSONL log with several key advantages:

1. Write Performance: ~5ms Per Event

async recordOperation(operation) {
  const lineageRecord = {
    id: this.generateLineageId(),
    session_id: this.sessionId,
    timestamp: new Date().toISOString(),
    type: operation.type,
    source: operation.source,
    target: operation.target,
    actor: operation.actor,
    metadata: {
      git_commit: await this.getCurrentGitCommit(),
      hostname: require('os').hostname(),
      process_id: process.pid
    }
  };

  this.operationBuffer.push(lineageRecord);

  if (this.operationBuffer.length >= this.bufferSize) {
    await this.flush();
  }
}

Events are buffered (default: 100 events) and batch-written to disk. This minimizes I/O overhead while maintaining near-real-time visibility.

2. Schema Flexibility with JSON

Each event type has custom event_data:

{
  "event_type": "worker_spawned",
  "event_data": {
    "worker_id": "worker-scan-001",
    "worker_type": "scan-worker"
  }
}

{
  "event_type": "task_failed",
  "event_data": {
    "error_details": {
      "error_type": "ValidationError",
      "error_message": "Missing required field: credentials",
      "stack_trace": "..."
    }
  }
}

This flexibility lets each event capture exactly what’s relevant without forcing a rigid schema.

3. Immutable Audit Trail

JSONL append-only logs mean:

  • No lost history: Events are never deleted or modified
  • Tamper evidence: Each event has a SHA-256 checksum
  • Compliance ready: 7-year retention for security events, 3 years for others
  • Easy archival: Rotate to daily files (lineage-2025-11-27.jsonl)

Query Performance: Sub-200ms Target

Logging is only half the story. You need to query that data fast. Cortex achieves sub-200ms queries through:

Index-Based Lookups

// In-memory index maps entities to line offsets
const index = {
  entities: {
    'task-security-scan-001': {
      operations: 47,
      last_access: '2025-11-27T14:23:00Z'
    }
  },
  actors: {
    'security-master': {
      operations: 234,
      last_operation: '2025-11-27T14:30:00Z'
    }
  }
};

Before scanning the entire log, check the index. If the entity doesn’t exist, return immediately.

LRU Cache for Hot Queries

class LRUCache {
  constructor(maxSize = 100) {
    this.cache = new Map();
    this.maxSize = maxSize;
  }

  get(key) {
    if (!this.cache.has(key)) return null;

    // Move to end (most recently used)
    const value = this.cache.get(key);
    this.cache.delete(key);
    this.cache.set(key, value);
    return value;
  }
}

Frequently queried tasks (e.g., monitoring dashboards checking current tasks) are served from memory.

Streaming Reads with Early Exit

const fileStream = fsSync.createReadStream(LINEAGE_LOG);
const rl = readline.createInterface({ input: fileStream });

for await (const line of rl) {
  const record = JSON.parse(line);

  if (record.task_id === targetTask) {
    results.push(record);

    if (results.length >= limit) {
      rl.close();
      fileStream.close();
      break; // Stop reading early
    }
  }
}

No need to load the entire 500MB log file into memory—stream it and stop when you have enough results.

Performance Benchmark Results

Query Type            | Cold Cache | Warm Cache | Target
---------------------|-----------|-----------|--------
Single task (100 events) | 45ms   | 3ms      | 200ms
Actor query (500 events) | 112ms  | 8ms      | 200ms
Time range (1000 events) | 187ms  | 15ms     | 200ms

All queries meet the 200ms target, even on cold cache.

Real Debugging Scenarios

Scenario 1: Task Stuck “In Progress”

Problem: Task task-deploy-staging-042 shows “in progress” for 2 hours but no activity.

Lineage Query:

./scripts/query-lineage.sh --task task-deploy-staging-042 --timeline

Output:

2025-11-27 12:00:00  task_created        (user-ryan)
2025-11-27 12:00:05  task_assigned       (coordinator → deployment-master)
2025-11-27 12:00:10  task_started        (deployment-master)
2025-11-27 12:00:15  worker_spawned      (worker-deploy-001)
2025-11-27 12:00:20  worker_started      (worker-deploy-001)
2025-11-27 12:15:30  worker_failed       (worker-deploy-001)
   └─ error: "Connection timeout to staging cluster"
2025-11-27 12:15:35  task_blocked        (reason: "retry_backoff")

Root Cause: Worker failed due to network timeout. Task is in exponential backoff retry. Not stuck—just waiting.

Fix: Check network connectivity to staging cluster or manually unblock with higher timeout.

Scenario 2: Mysterious Task Reassignment

Problem: Task completed by documentation-master but was assigned to development-master.

Lineage Query:

./scripts/query-lineage.sh --task task-feature-001

Key Events:

[
  {
    "event_type": "task_assigned",
    "event_data": { "master_id": "development-master" },
    "timestamp": "2025-11-27T10:00:00Z"
  },
  {
    "event_type": "handoff_created",
    "event_data": {
      "from_master": "development-master",
      "to_master": "documentation-master",
      "reason": "code_complete_needs_docs"
    },
    "timestamp": "2025-11-27T10:30:00Z"
  },
  {
    "event_type": "handoff_accepted",
    "actor": { "type": "master", "id": "documentation-master" },
    "timestamp": "2025-11-27T10:30:05Z"
  }
]

Root Cause: Not a reassignment—a handoff. Development completed code, handed off to Documentation for README updates. Working as designed.

Scenario 3: Token Budget Overrun

Problem: Monthly token budget hit limit on the 15th of the month.

Lineage Query:

// Aggregate token usage from worker_completed events
const events = await lineageQuery.queryByTimeRange(
  '2025-11-01T00:00:00Z',
  '2025-11-15T23:59:59Z'
);

const tokensByMaster = {};
events
  .filter(e => e.event_type === 'worker_completed')
  .forEach(e => {
    const master = e.event_data.master_id;
    const tokens = e.event_data.token_usage?.total_tokens || 0;
    tokensByMaster[master] = (tokensByMaster[master] || 0) + tokens;
  });

console.log(tokensByMaster);

Output:

{
  "development-master": 450000,
  "security-master": 1200000,  // ← Culprit
  "documentation-master": 50000
}

Root Cause: Security master’s code review workers used 1.2M tokens—70% of monthly budget. Reviews were running on every commit, including tiny typo fixes.

Fix: Implement smart review triggers—skip reviews for docs-only changes.

Performance Overhead Considerations

Write Overhead

Lineage tracking adds ~5-10ms per event to task execution. For a typical task with 10 events (created, assigned, started, 5 worker events, completed), that’s 50-100ms total—negligible compared to actual LLM inference time (1-10 seconds).

Mitigation:

  • Buffered writes (100 events before flush)
  • Async logging (non-blocking)
  • Disable in performance-critical paths (rare)

Storage Growth

At ~500 bytes per event:

  • 1,000 tasks/day × 10 events/task × 500 bytes = 5MB/day
  • Annual: ~1.8GB
  • With daily rotation and compression: ~500MB/year

Archival Strategy:

# Rotate daily
mv lineage.jsonl lineage-$(date +%Y-%m-%d).jsonl
gzip lineage-$(date -d '7 days ago' +%Y-%m-%d).jsonl

# Archive to S3 after 30 days
aws s3 cp lineage-2025-10-*.jsonl.gz s3://cortex-archives/lineage/

Query Load

The index file grows with unique entities/actors. At 10,000 tracked entities:

  • Index size: ~500KB (easily fits in memory)
  • Index load time: ~10ms
  • Index TTL refresh: 1 minute (configurable)

For systems tracking millions of entities, consider:

  • Sharded indexes (by date range)
  • SQLite for index storage (B-tree lookups)
  • Read replicas for dashboards

Building Your Own Lineage System

Want to implement task lineage in your own AI agent framework? Here’s the blueprint:

1. Define Your Event Schema

Start with the minimum viable set:

type LineageEvent = {
  lineage_id: string;
  task_id: string;
  event_type: 'created' | 'started' | 'completed' | 'failed';
  timestamp: string; // ISO-8601
  actor: {
    type: 'user' | 'system' | 'agent';
    id: string;
  };
  event_data?: Record<string, any>;
};

Add more event types as your system grows.

2. Choose Your Storage Backend

JSONL (Cortex’s approach):

  • Pros: Simple, portable, easy to parse, diff-friendly
  • Cons: No built-in indexing, manual retention management
  • Best for: <1M events, simple queries

SQLite:

  • Pros: Indexed queries, transactions, relations
  • Cons: Write contention at scale, harder to archive
  • Best for: 1M-10M events, complex queries

PostgreSQL + TimescaleDB:

  • Pros: Time-series optimization, retention policies, distributed queries
  • Cons: Infrastructure overhead
  • Best for: 10M+ events, analytics, multi-tenancy

3. Implement Instrumentation Points

Inject lineage tracking at key lifecycle hooks:

class TaskExecutor {
  async execute(task) {
    // Log task start
    await lineage.recordEvent({
      task_id: task.id,
      event_type: 'task_started',
      actor: { type: 'system', id: 'executor' }
    });

    try {
      const result = await this.runTask(task);

      // Log completion
      await lineage.recordEvent({
        task_id: task.id,
        event_type: 'task_completed',
        event_data: {
          duration_ms: Date.now() - task.start_time,
          deliverables: result.outputs
        }
      });

      return result;
    } catch (error) {
      // Log failure
      await lineage.recordEvent({
        task_id: task.id,
        event_type: 'task_failed',
        event_data: {
          error_type: error.constructor.name,
          error_message: error.message
        }
      });

      throw error;
    }
  }
}

4. Build Query Utilities

Expose queries your users actually need:

class LineageQuery {
  // Get all events for a task
  async getTaskLineage(taskId) { /* ... */ }

  // Find tasks by actor (who did what?)
  async getTasksByActor(actorId) { /* ... */ }

  // Find failures in time range (what broke recently?)
  async getFailures(startTime, endTime) { /* ... */ }

  // Aggregate metrics (how many tasks completed today?)
  async getMetrics(timeRange) { /* ... */ }
}

5. Optimize for Your Access Patterns

If you query by task_id most often:

  • Index on task_id
  • Partition by task creation date
  • Cache recent task lineages

If you query by actor:

  • Secondary index on actor.id
  • Inverted index (actor → task IDs)

If you need time-series analytics:

  • Use columnar storage (Parquet)
  • Pre-aggregate metrics (daily summaries)

The Bottom Line

Task lineage isn’t optional for production AI agent systems—it’s foundational. When (not if) your agents misbehave, you need to know:

  • What happened: Full event timeline
  • Who did it: Actor accountability
  • Why it happened: State transitions and errors
  • How to fix it: Replay, debug, prevent

Cortex’s 18-event lineage system gives you this visibility with minimal overhead (~5ms per event) and fast queries (sub-200ms). Whether you’re debugging a stuck task, tracking token usage, or generating compliance reports, lineage data is your source of truth.

Start simple—track task creation, start, and completion. Add worker events and state transitions as you need them. Before long, you’ll wonder how you ever debugged distributed AI agents without it.


Next in Series: Cortex’s Auto-Learning System: Feedback Loops That Actually Work - How we use lineage data to automatically improve agent performance.

#Cortex #Observability #Debugging #Architecture