Skip to content

Optimization Strategies

Model routing strategies — when to use Haiku, Sonnet, or Opus for cost efficiency

11 min read

You’re using Opus for every task — even formatting JSON and generating boilerplate. Sonnet handles those at 5x lower cost with identical quality. Model routing is the single biggest lever for cutting your API bill without sacrificing quality where it matters.

Model choice is the single biggest lever for cost at scale. Sonnet is 5x cheaper than Opus per token — $3/$15 per million input/output versus $15/$75. Routing tasks to the right model and effort level can cut your bill by 60-80% without sacrificing quality where it matters.

For budget caps and team spending policies, see Budget Controls. For cache-level optimization, see Cache & Prompt Optimization below.

Model Selection Guide

Model Selection Guide

Task TypeModelWhy
Code formatting, linting fixesclaude-sonnet-4-6Mechanical transforms; Sonnet handles these equally well
Test generationclaude-sonnet-4-6Pattern-following work with clear templates
Summarization, documentationclaude-sonnet-4-6Text synthesis where speed matters more than depth
Complex refactoringclaude-opus-4-6Cross-file reasoning with architectural implications
Bug diagnosisclaude-opus-4-6Requires deep reasoning about state and side effects
Architecture decisionsclaude-opus-4-6Trade-off analysis requiring broad context understanding

Switch models with the --model flag:

Terminal window
# Sonnet for routine tasks — ~$0.005/call
claude -p "Add JSDoc comments to this file" --model claude-sonnet-4-6
# Opus for complex reasoning — ~$0.016/call minimum
claude -p "Find the race condition in this auth flow" --model claude-opus-4-6
Try This

Compare model costs on the same task:

claude -p “Format this JSON: {a:1,b:2}” —output-format json | jq ‘{model: “opus”, cost: .total_cost_usd}’

claude -p “Format this JSON: {a:1,b:2}” —model claude-sonnet-4-6 —output-format json | jq ‘{model: “sonnet”, cost: .total_cost_usd}’

Is the Opus output better for this task? Or did you just pay 5x more for the same result?

Two-Pass Strategy

In CI pipelines, default to Sonnet and escalate to Opus only when Sonnet’s output fails validation or quality checks. A two-pass strategy — Sonnet drafts, Opus reviews only when needed — can cut pipeline costs by 60-80%.

#!/bin/bash
# two-pass.sh — Sonnet first, Opus only if needed
# Pass 1: Sonnet draft
RESULT=$(claude -p "Review this PR for issues" \
--model claude-sonnet-4-6 \
--output-format json \
--max-budget-usd 0.25)
REVIEW=$(echo "$RESULT" | jq -r '.result')
# Check if Sonnet flagged anything complex
if echo "$REVIEW" | grep -qi "complex\|unclear\|needs deeper\|architecture"; then
# Pass 2: Opus deep review
RESULT=$(claude -p "Provide a thorough review of this PR, \
focusing on architectural concerns and edge cases" \
--model claude-opus-4-6 \
--output-format json \
--max-budget-usd 1.00)
REVIEW=$(echo "$RESULT" | jq -r '.result')
fi
echo "$REVIEW"

Cost Comparison

Two-Pass vs Single-Pass Cost

Strategy100 PRs/dayMonthly Cost
All Opus100 x $0.05 = $5/day$110
All Sonnet100 x $0.01 = $1/day$22
Two-pass (20% escalation)100 x $0.01 + 20 x $0.05 = $2/day$44

The two-pass strategy costs 60% less than all-Opus while maintaining Opus-quality reviews on the PRs that need them.

Tip

In a wrapper script, set the default model based on the task category. Route formatting, linting, and boilerplate generation to Sonnet automatically. Reserve Opus for commands that explicitly request it or for tasks that require multi-file reasoning.

Effort Levels

The --effort flag controls how much thinking Claude does before responding. Lower effort means fewer output tokens, which directly reduces cost.

Effort Level Comparison

EffortBehaviorBest ForRelative Cost
lowMinimal thinking, short responsesLookups, simple transforms, yes/no questions~0.5x
mediumBalanced thinking and outputStandard code generation, explanations~1x (baseline)
highExtended reasoning, thorough outputComplex debugging, multi-step analysis~1.5-2x
maxMaximum reasoning depthHardest problems, architecture reviews~2-3x

At scale, effort tuning adds up. If 70% of your team’s calls are simple lookups or formatting tasks, routing them through --effort low can cut total output token costs nearly in half for those calls:

Terminal window
# Low effort for simple tasks
claude -p "What is the return type of this function?" --effort low
# Default effort for standard work
claude -p "Write unit tests for the auth module"
# Max effort for the hardest problems
claude -p "Diagnose why this distributed lock fails under contention" --effort max
Note

Effort affects output token count, not input tokens. The system prompt overhead is the same regardless of effort level. For calls dominated by input cost (large context, many files), effort tuning has minimal impact. It matters most for calls that generate long responses.

Model + Effort Matrix

Combining model selection with effort levels gives you a fine-grained cost control matrix:

Model + Effort Cost Matrix

ConfigurationUse CaseRelative Cost
Sonnet + low effortFormatting, simple lookups~0.1x (cheapest)
Sonnet + medium effortStandard code gen, docs~0.2x
Opus + low effortQuick expert opinions~0.5x
Opus + medium effortStandard complex tasks~1x (baseline)
Opus + max effortHardest reasoning tasks~2-3x (most expensive)

The cheapest configuration (Sonnet + low effort) costs roughly 10x less than the most expensive (Opus + max effort). For a team processing 1,000 calls/day, routing 70% to the cheapest tier can save thousands per month.


Cache & Prompt Optimization

Two back-to-back calls with identical system prompts should share a cache. They don’t — because your second call used a slightly different flag combination, invalidating the cache and paying full input token price twice. Caching is automatic, but whether you actually get cache hits depends on how you structure your workload.

Caching is automatic in Claude CLI, but how you structure your workload determines whether you get 90% savings or pay full price on every call. This section covers the caching strategies, batching patterns, and session management techniques that minimize your token costs at scale.

For budget caps and team policies, see Budget Controls.

How Caching Works

Every Claude CLI call includes a system prompt of 30,000-40,000+ tokens (tool definitions, safety rules, CLAUDE.md files). When two calls share the same system prompt, the second call reads from cache at one-tenth the input price instead of paying full price.

Cache pricing for Opus:

  • Full input: $15 per million tokens
  • Cache write: $18.75 per million tokens (1.25x full price — first call pays this)
  • Cache read: $1.50 per million tokens (0.1x full price — subsequent calls)

The savings compound: after the first call writes the cache, every subsequent call with the same system prompt gets a 90% discount on input tokens.

Cache Economics — Same Prompt, Three Calls
$0.17 cache write Call 1 $0.02 Call 2 $0.02 Call 3 90% off cache read full price subsequent calls reuse cache
Try This

Measure your cache hit rate. Run the same prompt twice:

claude -p “Explain caching” —output-format json | jq ‘{cache_read: .usage.cache_read_input_tokens, cache_create: .usage.cache_creation_input_tokens, cost: .total_cost_usd}’

Run it again immediately. How much did cache_read_input_tokens increase on the second call? That increase represents tokens you’re reading at 90% off.

Batch Similar Prompts

Calls that share the same system prompt benefit from cache reads. Run similar tasks together rather than interleaving different workloads:

Terminal window
# Good: batch similar tasks — second call gets cache hits
claude -p "Review src/auth.ts for security issues" --output-format json
claude -p "Review src/auth.ts for performance issues" --output-format json
# Wasteful: interleave different system contexts
claude -p "Review src/auth.ts" --output-format json
claude -p "Format this JSON" --model claude-sonnet-4-6 --output-format json
claude -p "Review src/db.ts" --output-format json

The interleaved pattern forces cache eviction between calls. Grouping similar tasks keeps the cache hot.

Batch Processing Script

For large-scale batch operations, process all files of the same type together:

#!/bin/bash
# batch-review.sh — process all files in a single batch for cache efficiency
BATCH_DIR="src/api"
BUDGET_PER_FILE=0.25
TOTAL_COST=0
echo "Reviewing all TypeScript files in $BATCH_DIR..."
for file in "$BATCH_DIR"/*.ts; do
RESULT=$(claude -p "Review $file for security issues" \
--output-format json \
--max-budget-usd "$BUDGET_PER_FILE" \
--no-session-persistence \
--permission-mode bypassPermissions)
COST=$(echo "$RESULT" | jq -r '.total_cost_usd // 0')
CACHE_READ=$(echo "$RESULT" | jq -r '.usage.cache_read_input_tokens // 0')
CACHE_WRITE=$(echo "$RESULT" | jq -r '.usage.cache_creation_input_tokens // 0')
echo "$file: \$$COST (cache read: $CACHE_READ, cache write: $CACHE_WRITE)"
TOTAL_COST=$(echo "$TOTAL_COST + $COST" | bc)
done
echo "Total batch cost: \$$TOTAL_COST"

The first file in the batch pays the cache write cost. Subsequent files get cache reads at 10% of the price.

Maintain Sessions

Resuming a session is cheaper than starting fresh. Every resumed turn reads the accumulated context from cache at the 0.1x rate instead of re-ingesting it:

Terminal window
# Start a session
RESULT=$(claude -p "Analyze the codebase structure" --output-format json)
SESSION=$(echo "$RESULT" | jq -r '.session_id')
# Resume — context is cached, subsequent turns are cheaper
claude -p "Now focus on the API layer" --resume "$SESSION" --output-format json
claude -p "Suggest improvements" --resume "$SESSION" --output-format json

When to Resume vs. Start Fresh

Resume vs. Fresh Session

ScenarioBest ApproachWhy
Follow-up questions on same topicResumeCached context saves 90% on input tokens
Same task, different fileNew sessionOld context is irrelevant noise — wastes tokens
Long-running analysis (10+ turns)Resume with budgetAccumulated context grows — set budget to cap total
Parallel independent tasksSeparate sessionsNo shared context — each session runs independently

The Cache Math

Understanding the numbers helps you make better decisions about session management:

Cache Impact on a 10-Turn Opus Session

ScenarioCalculationCost
Fresh call each time10 x 15K input tokens at $15/M$2.25
All cached1 cache write + 9 cache reads$0.65
Savings71%

For a team of 10 developers each running 50 calls per day, the difference between cached and uncached workloads is roughly $240/day — over $5,000/month in savings from session management alone.

Disable Tools When Not Needed

Tool descriptions inflate the system prompt. For pure text tasks, stripping them out reduces input tokens:

Terminal window
claude -p "Summarize this document" --tools ""

This removes tool definitions from the system prompt, trimming the per-call input token count. The savings are modest per call but compound across thousands of daily invocations.

Cache TTL Awareness

Gotcha

Cache entries have TTLs — 5 minutes for short-lived caches, 1 hour for extended caches. If you batch tasks with long gaps between them, the cache may expire and you pay full write cost again. Keep batch runs tight to maximize cache hits.

Optimizing for Cache TTL

Structure your CI pipelines to group Claude calls together rather than spreading them across pipeline stages:

# Good: Claude calls grouped together (cache stays hot)
jobs:
claude-review:
steps:
- run: claude -p "Lint check" --output-format json
- run: claude -p "Security check" --output-format json
- run: claude -p "Doc check" --output-format json
# Wasteful: Claude calls spread across stages (cache expires between)
jobs:
lint:
steps:
- run: eslint .
- run: claude -p "Lint check" --output-format json # cache write
test:
steps:
- run: npm test
- run: claude -p "Security check" --output-format json # cache expired, write again

Cache Monitoring

Check cache utilization in the JSON response to verify your optimization is working:

Terminal window
RESULT=$(claude -p "Review this file" --output-format json)
# Check cache metrics
CACHE_READ=$(echo "$RESULT" | jq '.usage.cache_read_input_tokens')
CACHE_WRITE=$(echo "$RESULT" | jq '.usage.cache_creation_input_tokens')
INPUT=$(echo "$RESULT" | jq '.usage.input_tokens')
echo "Cache read: $CACHE_READ tokens"
echo "Cache write: $CACHE_WRITE tokens"
echo "Uncached input: $INPUT tokens"
# Cache hit ratio
TOTAL=$((CACHE_READ + CACHE_WRITE + INPUT))
if [ "$TOTAL" -gt 0 ]; then
RATIO=$(echo "scale=2; $CACHE_READ * 100 / $TOTAL" | bc)
echo "Cache hit ratio: ${RATIO}%"
fi

A healthy production workload should show 80%+ cache hit ratio on input tokens. If you see high cache_creation_input_tokens on most calls, your batching strategy needs adjustment.

Next Steps

With both model routing and caching optimized, you have the key cost levers in place: Budget Controls limit spending, model optimization reduces per-call cost, and cache optimization reduces per-token cost. For the full cost picture including team benchmarks and startup overhead, see Cost at Scale.

Now Do This

Run claude -p “Hello” —output-format json | jq ‘.usage’ twice in a row. Compare cache_read_input_tokens between the two responses — the second call should show significantly more cache reads. That difference is the 90% savings kicking in. Structure your workloads to maximize this.