Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook
When we first launched Cortex—our autonomous multi-agent system for managing GitHub repositories—our daily LLM costs were unsustainably high. Through systematic optimization, we achieved a 90% cost reduction. This 10x improvement didn’t come from cutting features or quality. It came from ruthless prompt engineering, intelligent model selection, and architectural optimizations that make every token count.
Here’s the complete playbook, with real numbers and techniques you can apply to your own AI systems.
The Cost Crisis: Understanding the Baseline
Before optimization, Cortex was hemorrhaging tokens. Our initial architecture used a single powerful model (Claude Opus) for everything:
Early Architecture (Week 1)
- Model: Claude Opus 4 exclusively
- Average task: 15,000 input tokens + 8,000 output tokens
- Daily volume: 120 tasks/day
- Cost per task: High (premium model pricing)
- Daily cost: Baseline (100%)
The problem wasn’t the model choice—Claude Opus is brilliant. The problem was using a Ferrari to drive to the grocery store. Every task, regardless of complexity, paid the premium price.
Breaking Down the Cost Components
To optimize, we first needed to understand where money was going:
Token Distribution (Pre-Optimization)
| Category | Tokens/Task | % of Total | Relative Cost |
|---|---|---|---|
| System prompts | 8,500 | 37% | High |
| Task description | 2,000 | 9% | Low |
| Context (files, docs) | 3,500 | 15% | Medium |
| RAG retrieval | 1,000 | 4% | Low |
| Output generation | 8,000 | 35% | Very High |
| Total | 23,000 | 100% | Baseline |
The shocking revelation: 37% of our costs were system prompts. We were sending 8,500 tokens of instructions on every single request—most of which were irrelevant to the specific task at hand.
Optimization Strategy 1: Prompt Compression
Our first breakthrough came from applying aggressive compression to system prompts.
Technique: Context-Aware Augmentation (CAG)
Instead of sending massive prompts with every request, we pre-cache static knowledge:
Before (8,500 tokens):
# Development Master Agent - System Prompt
You are the Development Master in the cortex multi-agent system...
## Core Responsibilities
1. Development Planning & Architecture
- Break down features into implementable components
- Make architectural and design decisions
- Define coding standards and best practices
[... 6,000 more tokens of instructions ...]
## Worker Types
You can spawn the following workers:
1. feature-implementer
- Purpose: Implement new features
- When to use: New functionality needed
- Budget: 5,000 tokens
[... 1,500 more tokens of specifications ...]
After (2,600 tokens):
# Development Master v5.0 (CAG-optimized)
Role: Development strategist. Spawn workers, coordinate implementation.
CAG Cache (pre-loaded):
- Worker specs: See coordination/masters/development/cag-cache/
- Protocols: Worker spawn, handoff, result aggregation
- Quality gates: 80% test coverage, linting, type checking
For decisions: Access cached knowledge (10ms) vs RAG (200ms).
[... focused task-specific context ...]
Savings: 5,900 tokens per task × low cost/1K tokens = low cost saved per task
At 120 tasks/day, that’s significant daily savings from this change alone.
Implementation: Template-Based Prompt System
We moved from hardcoded prompts to versioned templates:
# coordination/prompts/masters/development.md
**Version**: v5.0
**Token Budget**: 30,000 tokens (master) + 20,000 (worker pool)
## CAG Static Knowledge Cache
Location: coordination/masters/development/cag-cache/static-knowledge.json
Contains (~2,600 tokens):
- Worker Types (4 development workers)
- Coordination Protocol (spawn, handoff, aggregate)
- Common Patterns (simple_feature, bug_fix_cycle, complex_feature)
This template system enabled:
- Versioning: Track prompt changes like code (v4.0 → v5.0 → v5.1)
- Reuse: Single prompt serves 100+ tasks/day
- A/B Testing: Compare prompt versions with real metrics
- Rollback: Instant revert if new prompt degrades performance
Optimization Strategy 2: Intelligent Model Selection
The second major optimization: stop using expensive models for simple tasks.
Model Tier System
We implemented a 4-tier routing system based on task complexity:
{
"fast": {
"models": ["claude-haiku"],
"complexity_range": [1, 4],
"cost_per_million": "low cost",
"use_cases": ["typo fixes", "code comments", "simple lookups"]
},
"balanced": {
"models": ["claude-sonnet-4"],
"complexity_range": [5, 7],
"cost_per_million": "moderate",
"use_cases": ["feature implementation", "bug fixes", "testing"]
},
"powerful": {
"models": ["claude-opus-4"],
"complexity_range": [8, 10],
"cost_per_million": "moderate cost",
"use_cases": ["security audits", "architecture reviews", "complex refactoring"]
},
"local": {
"models": ["llama2-70b"],
"cost_per_million": "moderate cost",
"use_cases": ["sensitive data", "PII handling", "high-volume batch"]
}
}
Complexity Scoring Algorithm
We built a simple but effective complexity scorer:
score_task_complexity() {
local task_description="moderate cost"
local score=5 # Base score (1-10)
# High complexity indicators (+1 each)
local high_keywords="security vulnerability exploit cve audit
architecture performance optimization distributed
migration refactor compliance encryption"
# Low complexity indicators (-1 each)
local low_keywords="simple basic quick minor typo format style comment"
# Score and clamp to 1-10
# Returns complexity score
}
Real Task Distribution (Week of Nov 25-Dec 1):
| Model | Tasks | Avg Tokens | Cost/Task | Daily Cost |
|---|---|---|---|---|
| Haiku | 72 (60%) | 8,000 | low cost | moderate cost |
| Sonnet | 38 (32%) | 12,000 | moderate cost | moderate cost |
| Opus | 10 (8%) | 18,000 | moderate cost | moderate cost |
| Total | 120 | 10,167 avg | moderate avg | moderate cost |
Compare to baseline (all Opus):
- Old: 120 tasks × moderate cost = moderate cost/day
- New: moderate cost/day
- Savings: moderate cost/day (72%)
The Economics of Task Routing
Here’s the critical insight: 60% of tasks can be handled by Haiku at 1/15th the cost of Opus. The key is accurate routing.
Our MoE (Mixture of Experts) router achieves 94.5% routing accuracy using semantic embeddings:
# coordination/masters/coordinator/lib/moe-router.sh
route_task_moe() {
# 1. Complexity scoring (1-10)
local complexity=moderate cost(score_task_complexity "moderate costtask_description")
# 2. Sensitivity detection (none|low|medium|high)
local sensitivity=moderate cost(detect_task_sensitivity "moderate costtask_description")
# 3. Model recommendation
if [ "moderate costcomplexity" -ge 8 ] || [ "moderate costsensitivity" = "high" ]; then
model="claude-opus-4" # Powerful tier
elif [ "moderate costcomplexity" -le 4 ]; then
model="claude-haiku" # Fast tier
else
model="claude-sonnet-4" # Balanced tier
fi
}
Routing accuracy matters:
- Misrouting simple task to Opus: Waste moderate cost
- Misrouting complex task to Haiku: Risk quality failure (moderate cost+ rework)
- Net cost of 5% routing error: ~moderate cost/day
Optimization Strategy 3: Context Window Management
The third optimization: minimize what you send.
RAG vs. CAG: The 95% Speed Improvement
Traditional RAG (Retrieval Augmented Generation) was killing us with redundant disk reads:
Old RAG Approach:
# Every task execution:
1. Read worker-types.json from disk (200ms)
2. Embed query (50ms)
3. Search vector DB (100ms)
4. Load relevant docs (150ms)
Total: 500ms per task, plus tokens for retrieved content
New CAG (Context-Aware Augmentation) Approach:
# At agent initialization:
1. Pre-load static knowledge into system prompt
2. Cache in agent context for session lifetime
# Per task:
1. Access from cached context (10ms)
Total: 10ms per task, zero additional tokens
Savings:
- Latency: 500ms → 10ms (98% faster)
- Tokens: 2,000 RAG tokens → 0 tokens
- Cost: low cost/task → low cost/task
- At 120 tasks/day: moderate cost/day saved
When to Use Each Approach
| Scenario | Method | Reason |
|---|---|---|
| Static worker specs | CAG | Never changes, load once |
| Recent task history | RAG | Dynamic, need latest |
| Code patterns | RAG | Growing knowledge base |
| Coordination protocols | CAG | Stable procedures |
| Repository inventory | Hybrid | Cache structure, RAG for details |
Optimization Strategy 4: Caching and Reuse
The fourth pillar: never compute the same thing twice.
Prompt Caching (Anthropic Feature)
Claude supports prompt caching for repeated prefixes. We exploit this aggressively:
# Every request includes the same 2,600-token system prompt
# First request: Pay full price (low cost)
# Next 5 minutes: Pay only for cache hit (low cost = 90% discount)
#
# With 120 tasks/day clustered in bursts:
# - 20 cache misses × low cost = low cost
# - 100 cache hits × low cost = low cost
# Total: moderate cost vs moderate cost uncached
# Savings: moderate cost/day
Template Reuse Patterns
We identified and templatized common patterns:
{
"simple_feature": {
"complexity": 5,
"workers": ["implementation-worker"],
"estimated_tokens": 8000,
"estimated_cost": "low cost"
},
"bug_fix_cycle": {
"complexity": 6,
"workers": ["fix-worker", "test-worker"],
"estimated_tokens": 12000,
"estimated_cost": "moderate cost"
},
"complex_feature": {
"complexity": 8,
"workers": ["implementation-worker", "test-worker", "review-worker"],
"estimated_tokens": 25000,
"estimated_cost": "moderate cost"
}
}
Impact: 80% of tasks match a template, reducing prompt engineering overhead and token variance.
Optimization Strategy 5: Output Compression
The final optimization: generate less, but better.
Structured Output Requirements
Before: Free-form responses averaging 8,000 tokens with lots of filler.
After: Strict JSON schemas with exactly what we need.
// coordination/schemas/worker-result.json
{
"status": "completed" | "failed",
"summary": "string (max 200 chars)",
"changes": [
{
"file": "path/to/file",
"action": "created" | "modified" | "deleted",
"reasoning": "string (max 100 chars)"
}
],
"tests": {
"coverage": "number",
"passed": "number",
"failed": "number"
}
}
Result: Average output dropped from 8,000 → 3,200 tokens (60% reduction).
Savings: 4,800 tokens × low cost/1K × 120 tasks = moderate cost/day
Incremental Responses
For long-running tasks, we switched from monolithic responses to streaming updates:
# Old: Wait 5 minutes, generate 15,000-token report
# New: Stream 10 × 1,500-token updates
# Benefits:
# 1. Faster feedback (30s vs 5min to first update)
# 2. Early termination if task fails (save remaining tokens)
# 3. Better UX (progressive disclosure)
Measured savings: 12% of tasks fail early, saving ~5,000 tokens each = moderate cost/day
The Complete Cost Breakdown: Before vs After
Before Optimization
| Component | Tokens | Cost/Task | Daily Cost (120 tasks) |
|---|---|---|---|
| System prompt (bloated) | 8,500 | moderate cost | moderate cost |
| Task context | 2,000 | low cost | moderate cost |
| File context | 3,500 | low cost | moderate cost |
| RAG retrieval | 1,000 | low cost | moderate cost |
| Output (verbose) | 8,000 | moderate cost | moderate cost |
| Total | 23,000 | moderate cost | moderate cost |
Model: All Opus (moderate cost/moderate cost per million tokens)
After Optimization
| Component | Tokens | Cost/Task | Daily Cost (120 tasks) |
|---|---|---|---|
| System prompt (CAG) | 2,600 | low cost | moderate cost |
| Task context | 1,500 | low cost | moderate cost |
| File context (filtered) | 2,000 | low cost | moderate cost |
| RAG/CAG (hybrid) | 200 | low cost | moderate cost |
| Output (structured) | 3,200 | low cost | moderate cost |
| Total | 9,500 | low cost | moderate cost |
Model Mix: 60% Haiku (low cost/moderate cost), 32% Sonnet (moderate cost/moderate cost), 8% Opus (moderate cost/moderate cost) Cache hit rate: 83%
Total Savings
- Before: moderate cost/day
- After: moderate cost/day
- Reduction: moderate cost/day (81.5%)
- Monthly: moderate cost → moderate cost (moderate cost saved)
- Annual: moderate cost → moderate cost (moderate cost saved)
Measuring Cost Per Task and ROI
To optimize further, we track cost per task type:
# coordination/metrics/model-selection.jsonl
{
"timestamp": "2025-11-28T10:30:00Z",
"task_id": "task-1234",
"model": "claude-sonnet-4",
"tier": "balanced",
"complexity_score": 6,
"input_tokens": 5200,
"output_tokens": 2800,
"cost_usd": 1.14,
"duration_sec": 45,
"success": true
}
Cost Per Task Type (7-Day Average)
| Task Type | Count | Avg Model | Avg Cost | Quality | ROI Score |
|---|---|---|---|---|---|
| Typo fix | 18 | Haiku | low cost | 98% | 9.2/10 |
| Comment addition | 24 | Haiku | low cost | 96% | 9.0/10 |
| Bug fix (simple) | 32 | Sonnet | moderate cost | 94% | 8.5/10 |
| Feature (small) | 28 | Sonnet | moderate cost | 92% | 8.2/10 |
| Feature (complex) | 12 | Opus | moderate cost | 96% | 7.8/10 |
| Security audit | 6 | Opus | moderate cost | 98% | 9.5/10 |
Key Insight: Security audits have highest cost (moderate cost) but best ROI (9.5/10) because failures are catastrophically expensive. Never cheap out on security.
The Cost of Over-Optimization: Quality Tradeoffs
Not all optimizations are worth it. We learned this the hard way.
Failed Experiment: Ultra-Compressed Prompts
Hypothesis: If 8,500 → 2,600 tokens worked, could we go to 1,000?
Result: Quality dropped from 94% → 78%. The savings weren’t worth it.
| Prompt Size | Cost/Task | Quality | Failure Rate | Rework Cost | Net Cost |
|---|---|---|---|---|---|
| 8,500 tokens | moderate cost | 96% | 4% | low cost | moderate cost |
| 2,600 tokens | low cost | 94% | 6% | low cost | low cost |
| 1,000 tokens | low cost | 78% | 22% | low cost | moderate cost |
Lesson: The 2,600-token sweet spot balances cost and quality. Going below that increases failure rates faster than it saves money.
The Haiku Trap
Mistake: Routing too many medium-complexity tasks to Haiku to save money.
Consequence:
- Haiku completion rate: 89% on complexity-6 tasks
- Sonnet completion rate: 94% on same tasks
- Failed Haiku tasks required Sonnet rework: moderate cost + low cost (wasted) = moderate cost total
Optimal routing: Use Haiku only for complexity ≤4. Accept the higher upfront cost for reliability.
When to Choose Quality Over Cost
Some tasks justify premium models regardless of complexity:
- Security: Always Opus. A missed vulnerability costs moderate costK+ in breach response.
- Architecture: Use Opus for foundational decisions that affect months of work.
- Customer-facing: Opus for user-visible features where quality matters.
- Critical path: Opus for blockers that delay other work.
Rule of thumb: If task failure costs >10x the model price difference, use the expensive model.
Advanced Techniques: Self-Improving Costs
The final frontier: having the system optimize itself.
Auto-Learning from Task Outcomes
We built a feedback loop that learns from every task:
# llm-mesh/auto-learning/hq-outcome-collector.py
# Collects high-quality task outcomes (score ≥4.0, 10-600 sec duration)
# Trains fine-tuned models on successful patterns
# Auto-deploys with A/B testing (90/10 split)
# Promotes if quality improves >0.2 and completion rate >5%
Results after 1,000 training examples:
- Fine-tuned Haiku matches base Sonnet quality on common tasks
- Cost reduction: moderate cost → low cost (82% savings)
- Applies to 24% of workload
- Additional savings: moderate cost/day
Dynamic Model Selection
Instead of static complexity rules, we use learned routing:
# coordination/masters/coordinator/lib/moe-router.sh
# 5-layer cascade:
# 1. Keyword fast-path (<1ms)
# 2. Semantic embedding (10-50ms)
# 3. RAG retrieval (50-150ms)
# 4. PyTorch routing model (100-300ms)
# 5. Confidence check → route or clarify
# Routing accuracy: 94.5% (vs 87.5% keyword-only)
# Misrouting cost: moderate cost/day → moderate cost/day (87% reduction)
Practical Implementation Guide
Want to apply these techniques? Here’s the playbook:
Week 1: Baseline and Instrumentation
# 1. Add cost tracking to every LLM call
log_llm_call() {
echo "{
\"timestamp\": \"moderate cost(date -Iseconds)\",
\"model\": \"moderate costmodel\",
\"input_tokens\": moderate costinput_tokens,
\"output_tokens\": moderate costoutput_tokens,
\"cost_usd\": moderate cost(echo "moderate costinput_tokens * moderate costinput_rate +
moderate costoutput_tokens * moderate costoutput_rate" | bc -l)
}" >> metrics/llm-costs.jsonl
}
# 2. Analyze cost distribution
cat metrics/llm-costs.jsonl | jq -s '
group_by(.model) |
map({
model: .[0].model,
calls: length,
total_cost: (map(.cost_usd) | add),
avg_cost: (map(.cost_usd) | add) / length
}) |
sort_by(-.total_cost)
'
Week 2: Low-Hanging Fruit
# 1. Compress system prompts
# Remove: Examples, redundant explanations, verbose formatting
# Keep: Essential instructions, constraints, output format
# Target: 50-70% reduction
# 2. Implement structured outputs
# Replace: "Explain your reasoning in detail"
# With: JSON schema with maxLength constraints
# 3. Enable prompt caching
# Group requests with identical system prompts
# Batch process to maximize cache hits
Week 3: Model Tiering
# 1. Score historical task complexity (manual sample of 100)
# 2. Define tier boundaries based on cost/quality tradeoff
# 3. Implement routing logic
# 4. A/B test with 10% traffic to new routing
# 5. Monitor misrouting rate and quality metrics
# 6. Graduate to 100% if quality maintained
Week 4: Iteration and Measurement
# Daily review:
# - Cost per task by type
# - Model distribution
# - Quality metrics (completion rate, rework rate)
# - Misrouting costs
# Weekly:
# - Identify high-cost task types
# - Analyze prompt efficiency
# - Review model tier boundaries
# - Test prompt variations
# Monthly:
# - Fine-tune models on collected data
# - Update routing logic with learned patterns
# - Recalibrate complexity scoring
The ROI of Prompt Engineering
Let’s talk return on investment.
Engineering time invested:
- Week 1 (instrumentation): 8 hours
- Week 2 (compression): 12 hours
- Week 3 (model tiering): 16 hours
- Week 4+ (iteration): 4 hours/week ongoing
- Total first month: 60 hours
Engineer cost: moderate cost/hour × 60 = moderate cost
Savings achieved:
- Month 1: moderate cost
- Payback period: 17 days
- Year 1 ROI: (moderate cost - moderate cost) / moderate cost = 2,017% ROI
Even if your savings are 10% of ours, you’d still break even in 2 months.
Key Takeaways
-
Profile first, optimize second: 37% of our costs were in system prompts we sent on every request. You can’t optimize what you don’t measure.
-
Model selection matters more than prompt length: Routing 60% of tasks to Haiku saved moderate cost/day. Compressing prompts saved moderate cost/day. Do both.
-
Cache everything static: RAG is for dynamic data. Static knowledge belongs in pre-cached context (CAG). 95% faster, zero tokens.
-
Structure your outputs: Free-form responses are expensive. JSON schemas with maxLength constraints cut output tokens 60%.
-
Quality has a price: Don’t over-optimize. The 2,600-token prompt is 3x smaller than original but maintains 94% quality. Going to 1,000 tokens saves low cost but costs low cost in rework.
-
Security is non-negotiable: Always use your most powerful model for security tasks. A missed vulnerability costs 100x the model savings.
-
Let the system learn: Fine-tuning on your workload produces specialist models that match larger models at 1/5th the cost.
-
Measure ROI, not just cost: The cheapest solution isn’t always the most economical. Factor in rework, failures, and opportunity cost.
Conclusion
Reducing Cortex’s LLM costs from moderate cost/day to moderate cost/day wasn’t magic—it was methodical engineering:
- Profile → Found system prompts were 37% of costs
- Compress → CAG caching reduced prompts 69%
- Route → Model tiering saved 72% with intelligent selection
- Structure → JSON schemas cut output tokens 60%
- Learn → Fine-tuning and self-optimization continue reducing costs
The result: moderate cost/year saved with improved quality (94% completion rate).
The techniques in this post apply whether you’re running a multi-agent system like Cortex or a simple chatbot. Start with instrumentation, find your biggest cost drivers, and optimize systematically.
Your LLM bill is a design choice, not a fixed cost.
Resources
- Cortex on GitHub - Open source multi-agent system
- Model tier configuration - Our complete routing logic
- Prompt templates - Versioned system prompts
- Cost tracking implementation - Full instrumentation code
All numbers in this post are from production Cortex deployments managing 12+ repositories with 120+ daily tasks. Your mileage may vary, but the techniques are universally applicable.