Optimization Strategies — Agent Mastered

You’re using Opus for every task — even formatting JSON and generating boilerplate. Sonnet handles those at lower cost with identical quality. Model routing is the fastest way to cut your API bill without sacrificing quality where it matters.

How to use this chapter: Read once to understand the models, then bookmark for reference when you’re optimizing costs. The tables and matrices below are designed for quick lookup during task routing decisions.

The split that works: Opus for complex refactoring and architecture decisions, Sonnet for code review, formatting, and routine tasks. That typically cuts costs 40-60% with no quality loss on the simple stuff.

Model choice has the biggest impact on cost at scale. Sonnet is ~40% cheaper than Opus per token — $3/$15 per million input/output versus $5/$25. Routing tasks to the right model and effort level can cut your bill meaningfully without sacrificing quality where it matters.

For budget caps and team spending policies, see Budget Controls. For cache-level optimization, see Cache & Prompt Optimization below.

Model Selection Guide

Task Type	Model	Why	Our Pick
Code formatting, linting fixes	`claude-sonnet-4-6`	Mechanical transforms; Sonnet handles these equally well	Sonnet — simple tasks
Test generation	`claude-sonnet-4-6`	Pattern-following work with clear templates	Sonnet — simple tasks
Summarization, documentation	`claude-sonnet-4-6`	Text synthesis where speed matters more than depth	Sonnet — general use
Complex refactoring	`claude-opus-4-6`	Cross-file reasoning with architectural implications	Opus — complex reasoning
Bug diagnosis	`claude-opus-4-6`	Requires deep reasoning about state and side effects	Opus — complex reasoning
Architecture decisions	`claude-opus-4-6`	Trade-off analysis requiring broad context understanding	Opus — complex reasoning

Switch models with the --model flag:

# Sonnet for routine tasks — ~$0.005/call
claude -p "Add JSDoc comments to this file" --model claude-sonnet-4-6

# Opus for complex reasoning — ~$0.016/call minimum
claude -p "Find the race condition in this auth flow" --model claude-opus-4-6

▸ Try This

Compare model costs on the same task:

claude -p “Format this JSON: {a:1,b:2}” —output-format json | jq ‘{model: “opus”, cost: .total_cost_usd}’

claude -p “Format this JSON: {a:1,b:2}” —model claude-sonnet-4-6 —output-format json | jq ‘{model: “sonnet”, cost: .total_cost_usd}’

Is the Opus output better for this task? Or did you just pay more for the same result?

Two-Pass Strategy

In CI pipelines, default to Sonnet and escalate to Opus only when Sonnet’s output fails validation or quality checks. A two-pass strategy — Sonnet drafts, Opus reviews only when needed — can cut pipeline costs by 60-80%.

Real savings: We reduced a daily report generation task from $0.40/run (Opus) to $0.03/run (Sonnet) by simplifying the prompt to focus on structured data extraction. Zero accuracy loss over 30 days of production runs. The key was recognizing that “comprehensive analysis” wording triggered Opus-level verbosity when only tabular output was needed.

#!/bin/bash
# two-pass.sh — Sonnet first, Opus only if needed

# Pass 1: Sonnet draft
RESULT=$(claude -p "Review this PR for issues" \
    --model claude-sonnet-4-6 \
    --output-format json \
    --max-budget-usd 0.25)

REVIEW=$(echo "$RESULT" | jq -r '.result')

# Check if Sonnet flagged anything complex
if echo "$REVIEW" | grep -qi "complex\|unclear\|needs deeper\|architecture"; then
    # Pass 2: Opus deep review
    RESULT=$(claude -p "Provide a thorough review of this PR, \
        focusing on architectural concerns and edge cases" \
        --model claude-opus-4-6 \
        --output-format json \
        --max-budget-usd 1.00)
    REVIEW=$(echo "$RESULT" | jq -r '.result')
fi

echo "$REVIEW"

Cost Comparison

Two-Pass vs Single-Pass Cost

Strategy	100 PRs/day	Monthly Cost
All Opus	100 x $0.05 = $5/day	$110
All Sonnet	100 x $0.01 = $1/day	$22
Two-pass (20% escalation)	100 x $0.01 + 20 x $0.05 = $2/day	$44

The two-pass strategy costs 60% less than all-Opus while maintaining Opus-quality reviews on the PRs that need them.

Fallback Model Only Triggers on 429 Rate Limits

CRITICAL: —fallback-model ONLY triggers on HTTP 429 rate limits, not API errors, timeouts, or budget exceeded. It’s not a general error handler. If you’re relying on fallback for resilience, combine it with explicit retry logic for other failure modes. See CI/CD Resilience for full coverage.

Streaming behavior: In multi-turn sessions, if a stream starts with Opus and hits a 429 at turn 3, the CLI seamlessly switches to the fallback model (e.g., Sonnet) and continues the stream without interruption. The modelUsage field in the JSON response shows both models’ token counts.

Tip

In a wrapper script, set the default model based on the task category. Route formatting, linting, and boilerplate generation to Sonnet automatically. Reserve Opus for commands that explicitly request it or for tasks that require multi-file reasoning.

Effort Levels

The --effort flag controls how much thinking Claude does before responding. Lower effort means fewer output tokens, which directly reduces cost.

Effort Level Comparison

Effort	Behavior	Best For	Relative Cost
`low`	Minimal thinking, short responses	Lookups, simple transforms, yes/no questions	~0.5x
`medium`	Balanced thinking and output	Standard code generation, explanations	~1x (baseline)
`high`	Extended reasoning, thorough output	Complex debugging, multi-step analysis	~1.5-2x
`max`	Maximum reasoning depth	Hardest problems, architecture reviews	~2-3x

At scale, effort tuning adds up. If 70% of your team’s calls are simple lookups or formatting tasks, routing them through --effort low can cut total output token costs nearly in half for those calls:

# Low effort for simple tasks
claude -p "What is the return type of this function?" --effort low

# Default effort for standard work
claude -p "Write unit tests for the auth module"

# Max effort for the hardest problems
claude -p "Diagnose why this distributed lock fails under contention" --effort max

Note

Effort affects output token count, not input tokens. The system prompt overhead is the same regardless of effort level. For calls dominated by input cost (large context, many files), effort tuning has minimal impact. It matters most for calls that generate long responses.

Model + Effort Matrix

Combining model selection with effort levels gives you a fine-grained cost control matrix:

Model + Effort Cost Matrix

Configuration	Use Case	Relative Cost
Sonnet + low effort	Formatting, simple lookups	~0.1x (cheapest)
Sonnet + medium effort	Standard code gen, docs	~0.2x
Opus + low effort	Quick expert opinions	~0.5x
Opus + medium effort	Standard complex tasks	~1x (baseline)
Opus + max effort	Hardest reasoning tasks	~2-3x (most expensive)

The cheapest configuration (Sonnet + low effort) costs roughly 10x less than the most expensive (Opus + max effort). For a team processing 1,000 calls/day, routing 70% to the cheapest tier can save thousands per month.

Cache & Prompt Optimization

Two back-to-back calls with identical system prompts should share a cache. They don’t — because your second call used a slightly different flag combination, invalidating the cache and paying full input token price twice. Caching is automatic, but whether you actually get cache hits depends on how you structure your workload.

Caching is automatic in Claude CLI, but how you structure your workload determines whether you get 90% savings or pay full price on every call. This section covers the caching strategies, batching patterns, and session management techniques that minimize your token costs at scale.

For budget caps and team policies, see Budget Controls.

How Caching Works

Every Claude CLI call includes a system prompt of 30,000-40,000+ tokens (tool definitions, safety rules, CLAUDE.md files). When two calls share the same system prompt, the second call reads from cache at one-tenth the input price instead of paying full price.

Cache pricing for Opus:

Full input: $15 per million tokens
Cache write: $18.75 per million tokens (1.25x full price — first call pays this)
Cache read: $1.50 per million tokens (0.1x full price — subsequent calls)

The savings compound: after the first call writes the cache, every subsequent call with the same system prompt gets a 90% discount on input tokens.

Cache Economics — Same Prompt, Three Calls

▸ Try This

Measure your cache hit rate. Run the same prompt twice:

claude -p “Explain caching” —output-format json | jq ‘{cache_read: .usage.cache_read_input_tokens, cache_create: .usage.cache_creation_input_tokens, cost: .total_cost_usd}’

Run it again immediately. How much did cache_read_input_tokens increase on the second call? That increase represents tokens you’re reading at 90% off.

Batch Similar Prompts

Calls that share the same system prompt benefit from cache reads. Run similar tasks together rather than interleaving different workloads:

# Good: batch similar tasks — second call gets cache hits
claude -p "Review src/auth.ts for security issues" --output-format json
claude -p "Review src/auth.ts for performance issues" --output-format json

# Wasteful: interleave different system contexts
claude -p "Review src/auth.ts" --output-format json
claude -p "Format this JSON" --model claude-sonnet-4-6 --output-format json
claude -p "Review src/db.ts" --output-format json

The interleaved pattern forces cache eviction between calls. Grouping similar tasks keeps the cache hot.

Batch Processing Script

For large-scale batch operations, process all files of the same type together:

#!/bin/bash
# batch-review.sh — process all files in a single batch for cache efficiency

BATCH_DIR="src/api"
BUDGET_PER_FILE=0.25
TOTAL_COST=0

echo "Reviewing all TypeScript files in $BATCH_DIR..."

for file in "$BATCH_DIR"/*.ts; do
    RESULT=$(claude -p "Review $file for security issues" \
        --output-format json \
        --max-budget-usd "$BUDGET_PER_FILE" \
        --no-session-persistence \
        --permission-mode bypassPermissions)

    COST=$(echo "$RESULT" | jq -r '.total_cost_usd // 0')
    CACHE_READ=$(echo "$RESULT" | jq -r '.usage.cache_read_input_tokens // 0')
    CACHE_WRITE=$(echo "$RESULT" | jq -r '.usage.cache_creation_input_tokens // 0')

    echo "$file: \$$COST (cache read: $CACHE_READ, cache write: $CACHE_WRITE)"
    TOTAL_COST=$(echo "$TOTAL_COST + $COST" | bc)
done

echo "Total batch cost: \$$TOTAL_COST"

The first file in the batch pays the cache write cost. Subsequent files get cache reads at 10% of the price.

Maintain Sessions

Resuming a session is cheaper than starting fresh. Every resumed turn reads the accumulated context from cache at the 0.1x rate instead of re-ingesting it:

# Start a session
RESULT=$(claude -p "Analyze the codebase structure" --output-format json)
SESSION=$(echo "$RESULT" | jq -r '.session_id')

# Resume — context is cached, subsequent turns are cheaper
claude -p "Now focus on the API layer" --resume "$SESSION" --output-format json
claude -p "Suggest improvements" --resume "$SESSION" --output-format json

When to Resume vs. Start Fresh

Resume vs. Fresh Session

Scenario	Best Approach	Why
Follow-up questions on same topic	Resume	Cached context saves 90% on input tokens
Same task, different file	New session	Old context is irrelevant noise — wastes tokens
Long-running analysis (10+ turns)	Resume with budget	Accumulated context grows — set budget to cap total
Parallel independent tasks	Separate sessions	No shared context — each session runs independently

The Cache Math

Understanding the numbers helps you make better decisions about session management:

Cache Impact on a 10-Turn Opus Session

Scenario	Calculation	Cost
Fresh call each time	10 x 15K input tokens at $15/M	$2.25
All cached	1 cache write + 9 cache reads	$0.65
Savings		71%

For a team of 10 developers each running 50 calls per day, the difference between cached and uncached workloads is roughly $240/day — over $5,000/month in savings from session management alone.

Disable Tools When Not Needed

Tool descriptions inflate the system prompt. For pure text tasks, stripping them out reduces input tokens:

claude -p "Summarize this document" --tools ""

This removes tool definitions from the system prompt, trimming the per-call input token count. The savings are modest per call but compound across thousands of daily invocations.

Cache TTL Awareness

Gotcha

Cache entries have TTLs — 5 minutes for short-lived caches, 1 hour for extended caches. If you batch tasks with long gaps between them, the cache may expire and you pay full write cost again. Keep batch runs tight to maximize cache hits.

Optimizing for Cache TTL

Structure your CI pipelines to group Claude calls together rather than spreading them across pipeline stages:

# Good: Claude calls grouped together (cache stays hot)
jobs:
  claude-review:
    steps:
      - run: claude -p "Lint check" --output-format json
      - run: claude -p "Security check" --output-format json
      - run: claude -p "Doc check" --output-format json

# Wasteful: Claude calls spread across stages (cache expires between)
jobs:
  lint:
    steps:
      - run: eslint .
      - run: claude -p "Lint check" --output-format json   # cache write
  test:
    steps:
      - run: npm test
      - run: claude -p "Security check" --output-format json  # cache expired, write again

Cache Monitoring

Check cache utilization in the JSON response to verify your optimization is working:

RESULT=$(claude -p "Review this file" --output-format json)

# Check cache metrics
CACHE_READ=$(echo "$RESULT" | jq '.usage.cache_read_input_tokens')
CACHE_WRITE=$(echo "$RESULT" | jq '.usage.cache_creation_input_tokens')
INPUT=$(echo "$RESULT" | jq '.usage.input_tokens')

echo "Cache read: $CACHE_READ tokens"
echo "Cache write: $CACHE_WRITE tokens"
echo "Uncached input: $INPUT tokens"

# Cache hit ratio
TOTAL=$((CACHE_READ + CACHE_WRITE + INPUT))
if [ "$TOTAL" -gt 0 ]; then
    RATIO=$(echo "scale=2; $CACHE_READ * 100 / $TOTAL" | bc)
    echo "Cache hit ratio: ${RATIO}%"
fi

A healthy production workload should show 80%+ cache hit ratio on input tokens. If you see high cache_creation_input_tokens on most calls, your batching strategy needs adjustment.

Next Steps

With both model routing and caching optimized, you have the key cost levers in place: Budget Controls limit spending, model optimization reduces per-call cost, and cache optimization reduces per-token cost. For the full cost picture including team benchmarks and startup overhead, see Cost at Scale.

→ Now Do This

Run claude -p “Hello” —output-format json | jq ‘.usage’ twice in a row. Compare cache_read_input_tokens between the two responses — the second call should show significantly more cache reads. That difference is the 90% savings kicking in. Structure your workloads to maximize this.