Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

When we first launched Cortex—our autonomous multi-agent system for managing GitHub repositories—our daily LLM costs were unsustainably high. Through systematic optimization, we achieved a 90% cost reduction. This 10x improvement didn’t come from cutting features or quality. It came from ruthless prompt engineering, intelligent model selection, and architectural optimizations that make every token count.

Here’s the complete playbook, with real numbers and techniques you can apply to your own AI systems.

The Cost Crisis: Understanding the Baseline

Before optimization, Cortex was hemorrhaging tokens. Our initial architecture used a single powerful model (Claude Opus) for everything:

Early Architecture (Week 1)

Model: Claude Opus 4 exclusively
Average task: 15,000 input tokens + 8,000 output tokens
Daily volume: 120 tasks/day
Cost per task: High (premium model pricing)
Daily cost: Baseline (100%)

The problem wasn’t the model choice—Claude Opus is brilliant. The problem was using a Ferrari to drive to the grocery store. Every task, regardless of complexity, paid the premium price.

Breaking Down the Cost Components

To optimize, we first needed to understand where money was going:

Token Distribution (Pre-Optimization)

Category	Tokens/Task	% of Total	Relative Cost
System prompts	8,500	37%	High
Task description	2,000	9%	Low
Context (files, docs)	3,500	15%	Medium
RAG retrieval	1,000	4%	Low
Output generation	8,000	35%	Very High
Total	23,000	100%	Baseline

The shocking revelation: 37% of our costs were system prompts. We were sending 8,500 tokens of instructions on every single request—most of which were irrelevant to the specific task at hand.

Optimization Strategy 1: Prompt Compression

Our first breakthrough came from applying aggressive compression to system prompts.

Technique: Context-Aware Augmentation (CAG)

Instead of sending massive prompts with every request, we pre-cache static knowledge:

Before (8,500 tokens):

# Development Master Agent - System Prompt

You are the Development Master in the cortex multi-agent system...

## Core Responsibilities
1. Development Planning & Architecture
   - Break down features into implementable components
   - Make architectural and design decisions
   - Define coding standards and best practices
   [... 6,000 more tokens of instructions ...]

## Worker Types
You can spawn the following workers:
1. feature-implementer
   - Purpose: Implement new features
   - When to use: New functionality needed
   - Budget: 5,000 tokens
   [... 1,500 more tokens of specifications ...]

After (2,600 tokens):

# Development Master v5.0 (CAG-optimized)

Role: Development strategist. Spawn workers, coordinate implementation.

CAG Cache (pre-loaded):
- Worker specs: See coordination/masters/development/cag-cache/
- Protocols: Worker spawn, handoff, result aggregation
- Quality gates: 80% test coverage, linting, type checking

For decisions: Access cached knowledge (10ms) vs RAG (200ms).
[... focused task-specific context ...]

Savings: 5,900 tokens per task × low cost/1K tokens = low cost saved per task

At 120 tasks/day, that’s significant daily savings from this change alone.

Implementation: Template-Based Prompt System

We moved from hardcoded prompts to versioned templates:

# coordination/prompts/masters/development.md
**Version**: v5.0
**Token Budget**: 30,000 tokens (master) + 20,000 (worker pool)

## CAG Static Knowledge Cache
Location: coordination/masters/development/cag-cache/static-knowledge.json
Contains (~2,600 tokens):
- Worker Types (4 development workers)
- Coordination Protocol (spawn, handoff, aggregate)
- Common Patterns (simple_feature, bug_fix_cycle, complex_feature)

This template system enabled:

Versioning: Track prompt changes like code (v4.0 → v5.0 → v5.1)
Reuse: Single prompt serves 100+ tasks/day
A/B Testing: Compare prompt versions with real metrics
Rollback: Instant revert if new prompt degrades performance

Optimization Strategy 2: Intelligent Model Selection

The second major optimization: stop using expensive models for simple tasks.

Model Tier System

We implemented a 4-tier routing system based on task complexity:

{
  "fast": {
    "models": ["claude-haiku"],
    "complexity_range": [1, 4],
    "cost_per_million": "low cost",
    "use_cases": ["typo fixes", "code comments", "simple lookups"]
  },
  "balanced": {
    "models": ["claude-sonnet-4"],
    "complexity_range": [5, 7],
    "cost_per_million": "moderate",
    "use_cases": ["feature implementation", "bug fixes", "testing"]
  },
  "powerful": {
    "models": ["claude-opus-4"],
    "complexity_range": [8, 10],
    "cost_per_million": "moderate cost",
    "use_cases": ["security audits", "architecture reviews", "complex refactoring"]
  },
  "local": {
    "models": ["llama2-70b"],
    "cost_per_million": "moderate cost",
    "use_cases": ["sensitive data", "PII handling", "high-volume batch"]
  }
}

Complexity Scoring Algorithm

We built a simple but effective complexity scorer:

score_task_complexity() {
  local task_description="moderate cost"
  local score=5  # Base score (1-10)

  # High complexity indicators (+1 each)
  local high_keywords="security vulnerability exploit cve audit
                       architecture performance optimization distributed
                       migration refactor compliance encryption"

  # Low complexity indicators (-1 each)
  local low_keywords="simple basic quick minor typo format style comment"

  # Score and clamp to 1-10
  # Returns complexity score
}

Real Task Distribution (Week of Nov 25-Dec 1):

Model	Tasks	Avg Tokens	Cost/Task	Daily Cost
Haiku	72 (60%)	8,000	low cost	moderate cost
Sonnet	38 (32%)	12,000	moderate cost	moderate cost
Opus	10 (8%)	18,000	moderate cost	moderate cost
Total	120	10,167 avg	moderate avg	moderate cost

Compare to baseline (all Opus):

Old: 120 tasks × moderate cost = moderate cost/day
New: moderate cost/day
Savings: moderate cost/day (72%)

The Economics of Task Routing

Here’s the critical insight: 60% of tasks can be handled by Haiku at 1/15th the cost of Opus. The key is accurate routing.

Our MoE (Mixture of Experts) router achieves 94.5% routing accuracy using semantic embeddings:

# coordination/masters/coordinator/lib/moe-router.sh
route_task_moe() {
  # 1. Complexity scoring (1-10)
  local complexity=moderate cost(score_task_complexity "moderate costtask_description")

  # 2. Sensitivity detection (none|low|medium|high)
  local sensitivity=moderate cost(detect_task_sensitivity "moderate costtask_description")

  # 3. Model recommendation
  if [ "moderate costcomplexity" -ge 8 ] || [ "moderate costsensitivity" = "high" ]; then
    model="claude-opus-4"  # Powerful tier
  elif [ "moderate costcomplexity" -le 4 ]; then
    model="claude-haiku"   # Fast tier
  else
    model="claude-sonnet-4"  # Balanced tier
  fi
}

Routing accuracy matters:

Misrouting simple task to Opus: Waste moderate cost
Misrouting complex task to Haiku: Risk quality failure (moderate cost+ rework)
Net cost of 5% routing error: ~moderate cost/day

Optimization Strategy 3: Context Window Management

The third optimization: minimize what you send.

RAG vs. CAG: The 95% Speed Improvement

Traditional RAG (Retrieval Augmented Generation) was killing us with redundant disk reads:

Old RAG Approach:

# Every task execution:
1. Read worker-types.json from disk (200ms)
2. Embed query (50ms)
3. Search vector DB (100ms)
4. Load relevant docs (150ms)
Total: 500ms per task, plus tokens for retrieved content

New CAG (Context-Aware Augmentation) Approach:

# At agent initialization:
1. Pre-load static knowledge into system prompt
2. Cache in agent context for session lifetime

# Per task:
1. Access from cached context (10ms)
Total: 10ms per task, zero additional tokens

Savings:

Latency: 500ms → 10ms (98% faster)
Tokens: 2,000 RAG tokens → 0 tokens
Cost: low cost/task → low cost/task
At 120 tasks/day: moderate cost/day saved

When to Use Each Approach

Scenario	Method	Reason
Static worker specs	CAG	Never changes, load once
Recent task history	RAG	Dynamic, need latest
Code patterns	RAG	Growing knowledge base
Coordination protocols	CAG	Stable procedures
Repository inventory	Hybrid	Cache structure, RAG for details

Optimization Strategy 4: Caching and Reuse

The fourth pillar: never compute the same thing twice.

Prompt Caching (Anthropic Feature)

Claude supports prompt caching for repeated prefixes. We exploit this aggressively:

# Every request includes the same 2,600-token system prompt
# First request: Pay full price (low cost)
# Next 5 minutes: Pay only for cache hit (low cost = 90% discount)
#
# With 120 tasks/day clustered in bursts:
# - 20 cache misses × low cost = low cost
# - 100 cache hits × low cost = low cost
# Total: moderate cost vs moderate cost uncached
# Savings: moderate cost/day

Template Reuse Patterns

We identified and templatized common patterns:

{
  "simple_feature": {
    "complexity": 5,
    "workers": ["implementation-worker"],
    "estimated_tokens": 8000,
    "estimated_cost": "low cost"
  },
  "bug_fix_cycle": {
    "complexity": 6,
    "workers": ["fix-worker", "test-worker"],
    "estimated_tokens": 12000,
    "estimated_cost": "moderate cost"
  },
  "complex_feature": {
    "complexity": 8,
    "workers": ["implementation-worker", "test-worker", "review-worker"],
    "estimated_tokens": 25000,
    "estimated_cost": "moderate cost"
  }
}

Impact: 80% of tasks match a template, reducing prompt engineering overhead and token variance.

Optimization Strategy 5: Output Compression

The final optimization: generate less, but better.

Structured Output Requirements

Before: Free-form responses averaging 8,000 tokens with lots of filler.

After: Strict JSON schemas with exactly what we need.

// coordination/schemas/worker-result.json
{
  "status": "completed" | "failed",
  "summary": "string (max 200 chars)",
  "changes": [
    {
      "file": "path/to/file",
      "action": "created" | "modified" | "deleted",
      "reasoning": "string (max 100 chars)"
    }
  ],
  "tests": {
    "coverage": "number",
    "passed": "number",
    "failed": "number"
  }
}

Result: Average output dropped from 8,000 → 3,200 tokens (60% reduction).

Savings: 4,800 tokens × low cost/1K × 120 tasks = moderate cost/day

Incremental Responses

For long-running tasks, we switched from monolithic responses to streaming updates:

# Old: Wait 5 minutes, generate 15,000-token report
# New: Stream 10 × 1,500-token updates

# Benefits:
# 1. Faster feedback (30s vs 5min to first update)
# 2. Early termination if task fails (save remaining tokens)
# 3. Better UX (progressive disclosure)

Measured savings: 12% of tasks fail early, saving ~5,000 tokens each = moderate cost/day

The Complete Cost Breakdown: Before vs After

Before Optimization

Component	Tokens	Cost/Task	Daily Cost (120 tasks)
System prompt (bloated)	8,500	moderate cost	moderate cost
Task context	2,000	low cost	moderate cost
File context	3,500	low cost	moderate cost
RAG retrieval	1,000	low cost	moderate cost
Output (verbose)	8,000	moderate cost	moderate cost
Total	23,000	moderate cost	moderate cost

Model: All Opus (moderate cost/moderate cost per million tokens)

After Optimization

Component	Tokens	Cost/Task	Daily Cost (120 tasks)
System prompt (CAG)	2,600	low cost	moderate cost
Task context	1,500	low cost	moderate cost
File context (filtered)	2,000	low cost	moderate cost
RAG/CAG (hybrid)	200	low cost	moderate cost
Output (structured)	3,200	low cost	moderate cost
Total	9,500	low cost	moderate cost

Model Mix: 60% Haiku (low cost/moderate cost), 32% Sonnet (moderate cost/moderate cost), 8% Opus (moderate cost/moderate cost) Cache hit rate: 83%

Total Savings

Before: moderate cost/day
After: moderate cost/day
Reduction: moderate cost/day (81.5%)
Monthly: moderate cost → moderate cost (moderate cost saved)
Annual: moderate cost → moderate cost (moderate cost saved)

Measuring Cost Per Task and ROI

To optimize further, we track cost per task type:

# coordination/metrics/model-selection.jsonl
{
  "timestamp": "2025-11-28T10:30:00Z",
  "task_id": "task-1234",
  "model": "claude-sonnet-4",
  "tier": "balanced",
  "complexity_score": 6,
  "input_tokens": 5200,
  "output_tokens": 2800,
  "cost_usd": 1.14,
  "duration_sec": 45,
  "success": true
}

Cost Per Task Type (7-Day Average)

Task Type	Count	Avg Model	Avg Cost	Quality	ROI Score
Typo fix	18	Haiku	low cost	98%	9.2/10
Comment addition	24	Haiku	low cost	96%	9.0/10
Bug fix (simple)	32	Sonnet	moderate cost	94%	8.5/10
Feature (small)	28	Sonnet	moderate cost	92%	8.2/10
Feature (complex)	12	Opus	moderate cost	96%	7.8/10
Security audit	6	Opus	moderate cost	98%	9.5/10

Key Insight: Security audits have highest cost (moderate cost) but best ROI (9.5/10) because failures are catastrophically expensive. Never cheap out on security.

The Cost of Over-Optimization: Quality Tradeoffs

Not all optimizations are worth it. We learned this the hard way.

Failed Experiment: Ultra-Compressed Prompts

Hypothesis: If 8,500 → 2,600 tokens worked, could we go to 1,000?

Result: Quality dropped from 94% → 78%. The savings weren’t worth it.

Prompt Size	Cost/Task	Quality	Failure Rate	Rework Cost	Net Cost
8,500 tokens	moderate cost	96%	4%	low cost	moderate cost
2,600 tokens	low cost	94%	6%	low cost	low cost
1,000 tokens	low cost	78%	22%	low cost	moderate cost

Lesson: The 2,600-token sweet spot balances cost and quality. Going below that increases failure rates faster than it saves money.

The Haiku Trap

Mistake: Routing too many medium-complexity tasks to Haiku to save money.

Consequence:

Haiku completion rate: 89% on complexity-6 tasks
Sonnet completion rate: 94% on same tasks
Failed Haiku tasks required Sonnet rework: moderate cost + low cost (wasted) = moderate cost total

Optimal routing: Use Haiku only for complexity ≤4. Accept the higher upfront cost for reliability.

When to Choose Quality Over Cost

Some tasks justify premium models regardless of complexity:

Security: Always Opus. A missed vulnerability costs moderate costK+ in breach response.
Architecture: Use Opus for foundational decisions that affect months of work.
Customer-facing: Opus for user-visible features where quality matters.
Critical path: Opus for blockers that delay other work.

Rule of thumb: If task failure costs >10x the model price difference, use the expensive model.

Advanced Techniques: Self-Improving Costs

The final frontier: having the system optimize itself.

Auto-Learning from Task Outcomes

We built a feedback loop that learns from every task:

# llm-mesh/auto-learning/hq-outcome-collector.py
# Collects high-quality task outcomes (score ≥4.0, 10-600 sec duration)
# Trains fine-tuned models on successful patterns
# Auto-deploys with A/B testing (90/10 split)
# Promotes if quality improves >0.2 and completion rate >5%

Results after 1,000 training examples:

Fine-tuned Haiku matches base Sonnet quality on common tasks
Cost reduction: moderate cost → low cost (82% savings)
Applies to 24% of workload
Additional savings: moderate cost/day

Dynamic Model Selection

Instead of static complexity rules, we use learned routing:

# coordination/masters/coordinator/lib/moe-router.sh
# 5-layer cascade:
# 1. Keyword fast-path (<1ms)
# 2. Semantic embedding (10-50ms)
# 3. RAG retrieval (50-150ms)
# 4. PyTorch routing model (100-300ms)
# 5. Confidence check → route or clarify

# Routing accuracy: 94.5% (vs 87.5% keyword-only)
# Misrouting cost: moderate cost/day → moderate cost/day (87% reduction)

Practical Implementation Guide

Want to apply these techniques? Here’s the playbook:

Week 1: Baseline and Instrumentation

# 1. Add cost tracking to every LLM call
log_llm_call() {
  echo "{
    \"timestamp\": \"moderate cost(date -Iseconds)\",
    \"model\": \"moderate costmodel\",
    \"input_tokens\": moderate costinput_tokens,
    \"output_tokens\": moderate costoutput_tokens,
    \"cost_usd\": moderate cost(echo "moderate costinput_tokens * moderate costinput_rate +
                        moderate costoutput_tokens * moderate costoutput_rate" | bc -l)
  }" >> metrics/llm-costs.jsonl
}

# 2. Analyze cost distribution
cat metrics/llm-costs.jsonl | jq -s '
  group_by(.model) |
  map({
    model: .[0].model,
    calls: length,
    total_cost: (map(.cost_usd) | add),
    avg_cost: (map(.cost_usd) | add) / length
  }) |
  sort_by(-.total_cost)
'

Week 2: Low-Hanging Fruit

# 1. Compress system prompts
# Remove: Examples, redundant explanations, verbose formatting
# Keep: Essential instructions, constraints, output format
# Target: 50-70% reduction

# 2. Implement structured outputs
# Replace: "Explain your reasoning in detail"
# With: JSON schema with maxLength constraints

# 3. Enable prompt caching
# Group requests with identical system prompts
# Batch process to maximize cache hits

Week 3: Model Tiering

# 1. Score historical task complexity (manual sample of 100)
# 2. Define tier boundaries based on cost/quality tradeoff
# 3. Implement routing logic
# 4. A/B test with 10% traffic to new routing
# 5. Monitor misrouting rate and quality metrics
# 6. Graduate to 100% if quality maintained

Week 4: Iteration and Measurement

# Daily review:
# - Cost per task by type
# - Model distribution
# - Quality metrics (completion rate, rework rate)
# - Misrouting costs

# Weekly:
# - Identify high-cost task types
# - Analyze prompt efficiency
# - Review model tier boundaries
# - Test prompt variations

# Monthly:
# - Fine-tune models on collected data
# - Update routing logic with learned patterns
# - Recalibrate complexity scoring

The ROI of Prompt Engineering

Let’s talk return on investment.

Engineering time invested:

Week 1 (instrumentation): 8 hours
Week 2 (compression): 12 hours
Week 3 (model tiering): 16 hours
Week 4+ (iteration): 4 hours/week ongoing
Total first month: 60 hours

Engineer cost: moderate cost/hour × 60 = moderate cost

Savings achieved:

Month 1: moderate cost
Payback period: 17 days
Year 1 ROI: (moderate cost - moderate cost) / moderate cost = 2,017% ROI

Even if your savings are 10% of ours, you’d still break even in 2 months.

Key Takeaways

Profile first, optimize second: 37% of our costs were in system prompts we sent on every request. You can’t optimize what you don’t measure.
Model selection matters more than prompt length: Routing 60% of tasks to Haiku saved moderate cost/day. Compressing prompts saved moderate cost/day. Do both.
Cache everything static: RAG is for dynamic data. Static knowledge belongs in pre-cached context (CAG). 95% faster, zero tokens.
Structure your outputs: Free-form responses are expensive. JSON schemas with maxLength constraints cut output tokens 60%.
Quality has a price: Don’t over-optimize. The 2,600-token prompt is 3x smaller than original but maintains 94% quality. Going to 1,000 tokens saves low cost but costs low cost in rework.
Security is non-negotiable: Always use your most powerful model for security tasks. A missed vulnerability costs 100x the model savings.
Let the system learn: Fine-tuning on your workload produces specialist models that match larger models at 1/5th the cost.
Measure ROI, not just cost: The cheapest solution isn’t always the most economical. Factor in rework, failures, and opportunity cost.

Conclusion

Reducing Cortex’s LLM costs from moderate cost/day to moderate cost/day wasn’t magic—it was methodical engineering:

Profile → Found system prompts were 37% of costs
Compress → CAG caching reduced prompts 69%
Route → Model tiering saved 72% with intelligent selection
Structure → JSON schemas cut output tokens 60%
Learn → Fine-tuning and self-optimization continue reducing costs

The result: moderate cost/year saved with improved quality (94% completion rate).

The techniques in this post apply whether you’re running a multi-agent system like Cortex or a simple chatbot. Start with instrumentation, find your biggest cost drivers, and optimize systematically.

Your LLM bill is a design choice, not a fixed cost.

Resources

Cortex on GitHub - Open source multi-agent system
Model tier configuration - Our complete routing logic
Prompt templates - Versioned system prompts
Cost tracking implementation - Full instrumentation code

All numbers in this post are from production Cortex deployments managing 12+ repositories with 120+ daily tasks. Your mileage may vary, but the techniques are universally applicable.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills