Skip to main content

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Ryan Dahlberg
Ryan Dahlberg
December 29, 2025 17 min read
Share:
Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

When we first launched Cortex—our autonomous multi-agent system for managing GitHub repositories—our daily LLM costs were unsustainably high. Through systematic optimization, we achieved a 90% cost reduction. This 10x improvement didn’t come from cutting features or quality. It came from ruthless prompt engineering, intelligent model selection, and architectural optimizations that make every token count.

Here’s the complete playbook, with real numbers and techniques you can apply to your own AI systems.

The Cost Crisis: Understanding the Baseline

Before optimization, Cortex was hemorrhaging tokens. Our initial architecture used a single powerful model (Claude Opus) for everything:

Early Architecture (Week 1)

  • Model: Claude Opus 4 exclusively
  • Average task: 15,000 input tokens + 8,000 output tokens
  • Daily volume: 120 tasks/day
  • Cost per task: High (premium model pricing)
  • Daily cost: Baseline (100%)

The problem wasn’t the model choice—Claude Opus is brilliant. The problem was using a Ferrari to drive to the grocery store. Every task, regardless of complexity, paid the premium price.

Breaking Down the Cost Components

To optimize, we first needed to understand where money was going:

Token Distribution (Pre-Optimization)

CategoryTokens/Task% of TotalRelative Cost
System prompts8,50037%High
Task description2,0009%Low
Context (files, docs)3,50015%Medium
RAG retrieval1,0004%Low
Output generation8,00035%Very High
Total23,000100%Baseline

The shocking revelation: 37% of our costs were system prompts. We were sending 8,500 tokens of instructions on every single request—most of which were irrelevant to the specific task at hand.

Optimization Strategy 1: Prompt Compression

Our first breakthrough came from applying aggressive compression to system prompts.

Technique: Context-Aware Augmentation (CAG)

Instead of sending massive prompts with every request, we pre-cache static knowledge:

Before (8,500 tokens):

# Development Master Agent - System Prompt

You are the Development Master in the cortex multi-agent system...

## Core Responsibilities
1. Development Planning & Architecture
   - Break down features into implementable components
   - Make architectural and design decisions
   - Define coding standards and best practices
   [... 6,000 more tokens of instructions ...]

## Worker Types
You can spawn the following workers:
1. feature-implementer
   - Purpose: Implement new features
   - When to use: New functionality needed
   - Budget: 5,000 tokens
   [... 1,500 more tokens of specifications ...]

After (2,600 tokens):

# Development Master v5.0 (CAG-optimized)

Role: Development strategist. Spawn workers, coordinate implementation.

CAG Cache (pre-loaded):
- Worker specs: See coordination/masters/development/cag-cache/
- Protocols: Worker spawn, handoff, result aggregation
- Quality gates: 80% test coverage, linting, type checking

For decisions: Access cached knowledge (10ms) vs RAG (200ms).
[... focused task-specific context ...]

Savings: 5,900 tokens per task × low cost/1K tokens = low cost saved per task

At 120 tasks/day, that’s significant daily savings from this change alone.

Implementation: Template-Based Prompt System

We moved from hardcoded prompts to versioned templates:

# coordination/prompts/masters/development.md
**Version**: v5.0
**Token Budget**: 30,000 tokens (master) + 20,000 (worker pool)

## CAG Static Knowledge Cache
Location: coordination/masters/development/cag-cache/static-knowledge.json
Contains (~2,600 tokens):
- Worker Types (4 development workers)
- Coordination Protocol (spawn, handoff, aggregate)
- Common Patterns (simple_feature, bug_fix_cycle, complex_feature)

This template system enabled:

  • Versioning: Track prompt changes like code (v4.0 → v5.0 → v5.1)
  • Reuse: Single prompt serves 100+ tasks/day
  • A/B Testing: Compare prompt versions with real metrics
  • Rollback: Instant revert if new prompt degrades performance

Optimization Strategy 2: Intelligent Model Selection

The second major optimization: stop using expensive models for simple tasks.

Model Tier System

We implemented a 4-tier routing system based on task complexity:

{
  "fast": {
    "models": ["claude-haiku"],
    "complexity_range": [1, 4],
    "cost_per_million": "low cost",
    "use_cases": ["typo fixes", "code comments", "simple lookups"]
  },
  "balanced": {
    "models": ["claude-sonnet-4"],
    "complexity_range": [5, 7],
    "cost_per_million": "moderate",
    "use_cases": ["feature implementation", "bug fixes", "testing"]
  },
  "powerful": {
    "models": ["claude-opus-4"],
    "complexity_range": [8, 10],
    "cost_per_million": "moderate cost",
    "use_cases": ["security audits", "architecture reviews", "complex refactoring"]
  },
  "local": {
    "models": ["llama2-70b"],
    "cost_per_million": "moderate cost",
    "use_cases": ["sensitive data", "PII handling", "high-volume batch"]
  }
}

Complexity Scoring Algorithm

We built a simple but effective complexity scorer:

score_task_complexity() {
  local task_description="moderate cost"
  local score=5  # Base score (1-10)

  # High complexity indicators (+1 each)
  local high_keywords="security vulnerability exploit cve audit
                       architecture performance optimization distributed
                       migration refactor compliance encryption"

  # Low complexity indicators (-1 each)
  local low_keywords="simple basic quick minor typo format style comment"

  # Score and clamp to 1-10
  # Returns complexity score
}

Real Task Distribution (Week of Nov 25-Dec 1):

ModelTasksAvg TokensCost/TaskDaily Cost
Haiku72 (60%)8,000low costmoderate cost
Sonnet38 (32%)12,000moderate costmoderate cost
Opus10 (8%)18,000moderate costmoderate cost
Total12010,167 avgmoderate avgmoderate cost

Compare to baseline (all Opus):

  • Old: 120 tasks × moderate cost = moderate cost/day
  • New: moderate cost/day
  • Savings: moderate cost/day (72%)

The Economics of Task Routing

Here’s the critical insight: 60% of tasks can be handled by Haiku at 1/15th the cost of Opus. The key is accurate routing.

Our MoE (Mixture of Experts) router achieves 94.5% routing accuracy using semantic embeddings:

# coordination/masters/coordinator/lib/moe-router.sh
route_task_moe() {
  # 1. Complexity scoring (1-10)
  local complexity=moderate cost(score_task_complexity "moderate costtask_description")

  # 2. Sensitivity detection (none|low|medium|high)
  local sensitivity=moderate cost(detect_task_sensitivity "moderate costtask_description")

  # 3. Model recommendation
  if [ "moderate costcomplexity" -ge 8 ] || [ "moderate costsensitivity" = "high" ]; then
    model="claude-opus-4"  # Powerful tier
  elif [ "moderate costcomplexity" -le 4 ]; then
    model="claude-haiku"   # Fast tier
  else
    model="claude-sonnet-4"  # Balanced tier
  fi
}

Routing accuracy matters:

  • Misrouting simple task to Opus: Waste moderate cost
  • Misrouting complex task to Haiku: Risk quality failure (moderate cost+ rework)
  • Net cost of 5% routing error: ~moderate cost/day

Optimization Strategy 3: Context Window Management

The third optimization: minimize what you send.

RAG vs. CAG: The 95% Speed Improvement

Traditional RAG (Retrieval Augmented Generation) was killing us with redundant disk reads:

Old RAG Approach:

# Every task execution:
1. Read worker-types.json from disk (200ms)
2. Embed query (50ms)
3. Search vector DB (100ms)
4. Load relevant docs (150ms)
Total: 500ms per task, plus tokens for retrieved content

New CAG (Context-Aware Augmentation) Approach:

# At agent initialization:
1. Pre-load static knowledge into system prompt
2. Cache in agent context for session lifetime

# Per task:
1. Access from cached context (10ms)
Total: 10ms per task, zero additional tokens

Savings:

  • Latency: 500ms → 10ms (98% faster)
  • Tokens: 2,000 RAG tokens → 0 tokens
  • Cost: low cost/task → low cost/task
  • At 120 tasks/day: moderate cost/day saved

When to Use Each Approach

ScenarioMethodReason
Static worker specsCAGNever changes, load once
Recent task historyRAGDynamic, need latest
Code patternsRAGGrowing knowledge base
Coordination protocolsCAGStable procedures
Repository inventoryHybridCache structure, RAG for details

Optimization Strategy 4: Caching and Reuse

The fourth pillar: never compute the same thing twice.

Prompt Caching (Anthropic Feature)

Claude supports prompt caching for repeated prefixes. We exploit this aggressively:

# Every request includes the same 2,600-token system prompt
# First request: Pay full price (low cost)
# Next 5 minutes: Pay only for cache hit (low cost = 90% discount)
#
# With 120 tasks/day clustered in bursts:
# - 20 cache misses × low cost = low cost
# - 100 cache hits × low cost = low cost
# Total: moderate cost vs moderate cost uncached
# Savings: moderate cost/day

Template Reuse Patterns

We identified and templatized common patterns:

{
  "simple_feature": {
    "complexity": 5,
    "workers": ["implementation-worker"],
    "estimated_tokens": 8000,
    "estimated_cost": "low cost"
  },
  "bug_fix_cycle": {
    "complexity": 6,
    "workers": ["fix-worker", "test-worker"],
    "estimated_tokens": 12000,
    "estimated_cost": "moderate cost"
  },
  "complex_feature": {
    "complexity": 8,
    "workers": ["implementation-worker", "test-worker", "review-worker"],
    "estimated_tokens": 25000,
    "estimated_cost": "moderate cost"
  }
}

Impact: 80% of tasks match a template, reducing prompt engineering overhead and token variance.

Optimization Strategy 5: Output Compression

The final optimization: generate less, but better.

Structured Output Requirements

Before: Free-form responses averaging 8,000 tokens with lots of filler.

After: Strict JSON schemas with exactly what we need.

// coordination/schemas/worker-result.json
{
  "status": "completed" | "failed",
  "summary": "string (max 200 chars)",
  "changes": [
    {
      "file": "path/to/file",
      "action": "created" | "modified" | "deleted",
      "reasoning": "string (max 100 chars)"
    }
  ],
  "tests": {
    "coverage": "number",
    "passed": "number",
    "failed": "number"
  }
}

Result: Average output dropped from 8,000 → 3,200 tokens (60% reduction).

Savings: 4,800 tokens × low cost/1K × 120 tasks = moderate cost/day

Incremental Responses

For long-running tasks, we switched from monolithic responses to streaming updates:

# Old: Wait 5 minutes, generate 15,000-token report
# New: Stream 10 × 1,500-token updates

# Benefits:
# 1. Faster feedback (30s vs 5min to first update)
# 2. Early termination if task fails (save remaining tokens)
# 3. Better UX (progressive disclosure)

Measured savings: 12% of tasks fail early, saving ~5,000 tokens each = moderate cost/day

The Complete Cost Breakdown: Before vs After

Before Optimization

ComponentTokensCost/TaskDaily Cost (120 tasks)
System prompt (bloated)8,500moderate costmoderate cost
Task context2,000low costmoderate cost
File context3,500low costmoderate cost
RAG retrieval1,000low costmoderate cost
Output (verbose)8,000moderate costmoderate cost
Total23,000moderate costmoderate cost

Model: All Opus (moderate cost/moderate cost per million tokens)

After Optimization

ComponentTokensCost/TaskDaily Cost (120 tasks)
System prompt (CAG)2,600low costmoderate cost
Task context1,500low costmoderate cost
File context (filtered)2,000low costmoderate cost
RAG/CAG (hybrid)200low costmoderate cost
Output (structured)3,200low costmoderate cost
Total9,500low costmoderate cost

Model Mix: 60% Haiku (low cost/moderate cost), 32% Sonnet (moderate cost/moderate cost), 8% Opus (moderate cost/moderate cost) Cache hit rate: 83%

Total Savings

  • Before: moderate cost/day
  • After: moderate cost/day
  • Reduction: moderate cost/day (81.5%)
  • Monthly: moderate cost → moderate cost (moderate cost saved)
  • Annual: moderate cost → moderate cost (moderate cost saved)

Measuring Cost Per Task and ROI

To optimize further, we track cost per task type:

# coordination/metrics/model-selection.jsonl
{
  "timestamp": "2025-11-28T10:30:00Z",
  "task_id": "task-1234",
  "model": "claude-sonnet-4",
  "tier": "balanced",
  "complexity_score": 6,
  "input_tokens": 5200,
  "output_tokens": 2800,
  "cost_usd": 1.14,
  "duration_sec": 45,
  "success": true
}

Cost Per Task Type (7-Day Average)

Task TypeCountAvg ModelAvg CostQualityROI Score
Typo fix18Haikulow cost98%9.2/10
Comment addition24Haikulow cost96%9.0/10
Bug fix (simple)32Sonnetmoderate cost94%8.5/10
Feature (small)28Sonnetmoderate cost92%8.2/10
Feature (complex)12Opusmoderate cost96%7.8/10
Security audit6Opusmoderate cost98%9.5/10

Key Insight: Security audits have highest cost (moderate cost) but best ROI (9.5/10) because failures are catastrophically expensive. Never cheap out on security.

The Cost of Over-Optimization: Quality Tradeoffs

Not all optimizations are worth it. We learned this the hard way.

Failed Experiment: Ultra-Compressed Prompts

Hypothesis: If 8,500 → 2,600 tokens worked, could we go to 1,000?

Result: Quality dropped from 94% → 78%. The savings weren’t worth it.

Prompt SizeCost/TaskQualityFailure RateRework CostNet Cost
8,500 tokensmoderate cost96%4%low costmoderate cost
2,600 tokenslow cost94%6%low costlow cost
1,000 tokenslow cost78%22%low costmoderate cost

Lesson: The 2,600-token sweet spot balances cost and quality. Going below that increases failure rates faster than it saves money.

The Haiku Trap

Mistake: Routing too many medium-complexity tasks to Haiku to save money.

Consequence:

  • Haiku completion rate: 89% on complexity-6 tasks
  • Sonnet completion rate: 94% on same tasks
  • Failed Haiku tasks required Sonnet rework: moderate cost + low cost (wasted) = moderate cost total

Optimal routing: Use Haiku only for complexity ≤4. Accept the higher upfront cost for reliability.

When to Choose Quality Over Cost

Some tasks justify premium models regardless of complexity:

  1. Security: Always Opus. A missed vulnerability costs moderate costK+ in breach response.
  2. Architecture: Use Opus for foundational decisions that affect months of work.
  3. Customer-facing: Opus for user-visible features where quality matters.
  4. Critical path: Opus for blockers that delay other work.

Rule of thumb: If task failure costs >10x the model price difference, use the expensive model.

Advanced Techniques: Self-Improving Costs

The final frontier: having the system optimize itself.

Auto-Learning from Task Outcomes

We built a feedback loop that learns from every task:

# llm-mesh/auto-learning/hq-outcome-collector.py
# Collects high-quality task outcomes (score ≥4.0, 10-600 sec duration)
# Trains fine-tuned models on successful patterns
# Auto-deploys with A/B testing (90/10 split)
# Promotes if quality improves >0.2 and completion rate >5%

Results after 1,000 training examples:

  • Fine-tuned Haiku matches base Sonnet quality on common tasks
  • Cost reduction: moderate cost → low cost (82% savings)
  • Applies to 24% of workload
  • Additional savings: moderate cost/day

Dynamic Model Selection

Instead of static complexity rules, we use learned routing:

# coordination/masters/coordinator/lib/moe-router.sh
# 5-layer cascade:
# 1. Keyword fast-path (<1ms)
# 2. Semantic embedding (10-50ms)
# 3. RAG retrieval (50-150ms)
# 4. PyTorch routing model (100-300ms)
# 5. Confidence check → route or clarify

# Routing accuracy: 94.5% (vs 87.5% keyword-only)
# Misrouting cost: moderate cost/day → moderate cost/day (87% reduction)

Practical Implementation Guide

Want to apply these techniques? Here’s the playbook:

Week 1: Baseline and Instrumentation

# 1. Add cost tracking to every LLM call
log_llm_call() {
  echo "{
    \"timestamp\": \"moderate cost(date -Iseconds)\",
    \"model\": \"moderate costmodel\",
    \"input_tokens\": moderate costinput_tokens,
    \"output_tokens\": moderate costoutput_tokens,
    \"cost_usd\": moderate cost(echo "moderate costinput_tokens * moderate costinput_rate +
                        moderate costoutput_tokens * moderate costoutput_rate" | bc -l)
  }" >> metrics/llm-costs.jsonl
}

# 2. Analyze cost distribution
cat metrics/llm-costs.jsonl | jq -s '
  group_by(.model) |
  map({
    model: .[0].model,
    calls: length,
    total_cost: (map(.cost_usd) | add),
    avg_cost: (map(.cost_usd) | add) / length
  }) |
  sort_by(-.total_cost)
'

Week 2: Low-Hanging Fruit

# 1. Compress system prompts
# Remove: Examples, redundant explanations, verbose formatting
# Keep: Essential instructions, constraints, output format
# Target: 50-70% reduction

# 2. Implement structured outputs
# Replace: "Explain your reasoning in detail"
# With: JSON schema with maxLength constraints

# 3. Enable prompt caching
# Group requests with identical system prompts
# Batch process to maximize cache hits

Week 3: Model Tiering

# 1. Score historical task complexity (manual sample of 100)
# 2. Define tier boundaries based on cost/quality tradeoff
# 3. Implement routing logic
# 4. A/B test with 10% traffic to new routing
# 5. Monitor misrouting rate and quality metrics
# 6. Graduate to 100% if quality maintained

Week 4: Iteration and Measurement

# Daily review:
# - Cost per task by type
# - Model distribution
# - Quality metrics (completion rate, rework rate)
# - Misrouting costs

# Weekly:
# - Identify high-cost task types
# - Analyze prompt efficiency
# - Review model tier boundaries
# - Test prompt variations

# Monthly:
# - Fine-tune models on collected data
# - Update routing logic with learned patterns
# - Recalibrate complexity scoring

The ROI of Prompt Engineering

Let’s talk return on investment.

Engineering time invested:

  • Week 1 (instrumentation): 8 hours
  • Week 2 (compression): 12 hours
  • Week 3 (model tiering): 16 hours
  • Week 4+ (iteration): 4 hours/week ongoing
  • Total first month: 60 hours

Engineer cost: moderate cost/hour × 60 = moderate cost

Savings achieved:

  • Month 1: moderate cost
  • Payback period: 17 days
  • Year 1 ROI: (moderate cost - moderate cost) / moderate cost = 2,017% ROI

Even if your savings are 10% of ours, you’d still break even in 2 months.

Key Takeaways

  1. Profile first, optimize second: 37% of our costs were in system prompts we sent on every request. You can’t optimize what you don’t measure.

  2. Model selection matters more than prompt length: Routing 60% of tasks to Haiku saved moderate cost/day. Compressing prompts saved moderate cost/day. Do both.

  3. Cache everything static: RAG is for dynamic data. Static knowledge belongs in pre-cached context (CAG). 95% faster, zero tokens.

  4. Structure your outputs: Free-form responses are expensive. JSON schemas with maxLength constraints cut output tokens 60%.

  5. Quality has a price: Don’t over-optimize. The 2,600-token prompt is 3x smaller than original but maintains 94% quality. Going to 1,000 tokens saves low cost but costs low cost in rework.

  6. Security is non-negotiable: Always use your most powerful model for security tasks. A missed vulnerability costs 100x the model savings.

  7. Let the system learn: Fine-tuning on your workload produces specialist models that match larger models at 1/5th the cost.

  8. Measure ROI, not just cost: The cheapest solution isn’t always the most economical. Factor in rework, failures, and opportunity cost.

Conclusion

Reducing Cortex’s LLM costs from moderate cost/day to moderate cost/day wasn’t magic—it was methodical engineering:

  • Profile → Found system prompts were 37% of costs
  • Compress → CAG caching reduced prompts 69%
  • Route → Model tiering saved 72% with intelligent selection
  • Structure → JSON schemas cut output tokens 60%
  • Learn → Fine-tuning and self-optimization continue reducing costs

The result: moderate cost/year saved with improved quality (94% completion rate).

The techniques in this post apply whether you’re running a multi-agent system like Cortex or a simple chatbot. Start with instrumentation, find your biggest cost drivers, and optimize systematically.

Your LLM bill is a design choice, not a fixed cost.


Resources

All numbers in this post are from production Cortex deployments managing 12+ repositories with 120+ daily tasks. Your mileage may vary, but the techniques are universally applicable.

#Cortex #Cost Optimization #Prompt Engineering #Economics