Prompt Caching Strategy for Maximum Cache Hits

Output

Prompt Caching Strategy

How Anthropic Prompt Caching Works

Claude caches the prefix of your prompt. If subsequent requests share the same prefix, you pay 90% less for those cached tokens and get faster responses.

Cache write	1.25x base price (one-time)
Cache read	0.1x base price (90% savings!)
TTL	5 minutes (resets on each use)

Key Principle: Stable Prefix

┌──────────────────────────────────────────────────────────┐
│  CACHED (stable across requests)                       │
│  ┌────────────────────────────────────────────────────┐  │
│  │ System prompt (CLAUDE.md core)              ~8k   │  │
│  │ Project structure                          ~2k   │  │
│  │ Tool definitions                           ~4k   │  │
│  │ Output instructions (kontask format)       ~2k   │  │
│  └────────────────────────────────────────────────────┘  │
├──────────────────────────────────────────────────────────┤
│  DYNAMIC (changes per request)                        │
│  ┌────────────────────────────────────────────────────┐  │
│  │ Session history / conversation             var   │  │
│  │ Current request context                    var   │  │
│  │ User prompt                                var   │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Implementation Strategy

1. Freeze CLAUDE.md	Don't modify frequently. Changes invalidate cache for all users.
2. Version context	Use a hash/version in context. Same version = cache hit.
3. Order matters	Put stable content FIRST, dynamic content LAST.
4. Batch similar	Group requests by scope (vibetools vs product) for shared context.
5. Keep sessions warm	Activity within 5 min extends cache TTL.

Proposed Context Structure

// Layer 1: Universal (cached across ALL requests)
[CACHE_CONTROL: ephemeral]
- Base system prompt (persona, safety)
- Tool definitions
- Output format requirements

// Layer 2: Scope-specific (cached within scope)
[CACHE_CONTROL: ephemeral]
- IF vibetools: konui/konsole docs
- IF product: listings/CMS docs
- Relevant CLAUDE.md sections

// Layer 3: Session (not cached, changes each turn)
[NO CACHE]
- Conversation history
- Current working context
- User's prompt

Quick Win: Quick Turn Caching

Quick Turn is stateless - perfect for caching:

// Every Quick Turn request uses SAME system prompt
const QT_SYSTEM = `You are a fast Q&A assistant.
Answer briefly and directly.
No tools, no file access, just knowledge.`;

// This ~50 tokens gets cached, 90% savings on every QT

Expected Savings

Scenario	Before	After	Savings
Quick Turn (stateless)	100%	10%	90%
Full turn (same scope)	100%	40%	60%
Full turn (scope switch)	100%	70%	30%

Implementation Path

Quick Turn: Add fixed system prompt with cache_control header
Konsole: Layer context with stable prefix first
Monitor: Track cache_read_input_tokens in StatusLine data
Optimize: A/B test context orderings for best cache hit rate

Quick Actions

Original Request

No layout configured

Details

Type General

Status Completed

Scope vibetools

Tags performancecachingarchitecture

Created 5 Jan 2026, 2:16 pm

Updated 5 Jan 2026, 2:16 pm

Created By claude

Raw Data

{
  "id": "927ccf49-29bb-43ee-a2ad-aba03c7a9af3",
  "type": "general",
  "status": "completed",
  "title": "Prompt Caching Strategy for Maximum Cache Hits",
  "description": "Design for consistent context preambles to maximize Anthropic prompt caching",
  "context": {
    "output": "<div style=\"background:#1e1e1e;color:#e0e0e0;padding:20px;border-radius:8px;font-family:sans-serif;\"><h2 style=\"color:#fff;margin:0 0 20px 0;\">Prompt Caching Strategy</h2><div style=\"background:#2d2d2d;padding:16px;border-radius:6px;margin-bottom:16px;\"><h3 style=\"margin:0 0 12px 0;color:#61afef;\">How Anthropic Prompt Caching Works</h3><p style=\"margin:0 0 12px 0;color:#9ca3af;line-height:1.6;\">Claude caches the <strong>prefix</strong> of your prompt. If subsequent requests share the same prefix, you pay <strong>90% less</strong> for those cached tokens and get <strong>faster</strong> responses.</p><table style=\"width:100%;border-collapse:collapse;font-size:13px;\"><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:8px 0;color:#e5c07b;\">Cache write</td><td style=\"color:#9ca3af;\">1.25x base price (one-time)</td></tr><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:8px 0;color:#98c379;\">Cache read</td><td style=\"color:#9ca3af;\">0.1x base price (90% savings!)</td></tr><tr><td style=\"padding:8px 0;color:#7f848e;\">TTL</td><td style=\"color:#9ca3af;\">5 minutes (resets on each use)</td></tr></table></div><div style=\"background:#2d2d2d;padding:16px;border-radius:6px;margin-bottom:16px;\"><h3 style=\"margin:0 0 12px 0;color:#98c379;\">Key Principle: Stable Prefix</h3><pre style=\"margin:0;font-family:monospace;font-size:11px;line-height:1.5;color:#abb2bf;background:#1a1a1a;padding:12px;border-radius:4px;overflow-x:auto;\">┌──────────────────────────────────────────────────────────┐\n│  <span style=\"color:#98c379;\">CACHED (stable across requests)</span>                       │\n│  ┌────────────────────────────────────────────────────┐  │\n│  │ System prompt (CLAUDE.md core)              ~8k   │  │\n│  │ Project structure                          ~2k   │  │\n│  │ Tool definitions                           ~4k   │  │\n│  │ Output instructions (kontask format)       ~2k   │  │\n│  └────────────────────────────────────────────────────┘  │\n├──────────────────────────────────────────────────────────┤\n│  <span style=\"color:#f97316;\">DYNAMIC (changes per request)</span>                        │\n│  ┌────────────────────────────────────────────────────┐  │\n│  │ Session history / conversation             var   │  │\n│  │ Current request context                    var   │  │\n│  │ User prompt                                var   │  │\n│  └────────────────────────────────────────────────────┘  │\n└──────────────────────────────────────────────────────────┘</pre></div><div style=\"background:#2d2d2d;padding:16px;border-radius:6px;margin-bottom:16px;\"><h3 style=\"margin:0 0 12px 0;color:#ce93d8;\">Implementation Strategy</h3><table style=\"width:100%;border-collapse:collapse;font-size:13px;\"><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:10px 0;color:#e5c07b;width:30%;\"><strong>1. Freeze CLAUDE.md</strong></td><td style=\"color:#9ca3af;\">Don't modify frequently. Changes invalidate cache for all users.</td></tr><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:10px 0;color:#e5c07b;\"><strong>2. Version context</strong></td><td style=\"color:#9ca3af;\">Use a hash/version in context. Same version = cache hit.</td></tr><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:10px 0;color:#e5c07b;\"><strong>3. Order matters</strong></td><td style=\"color:#9ca3af;\">Put stable content FIRST, dynamic content LAST.</td></tr><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:10px 0;color:#e5c07b;\"><strong>4. Batch similar</strong></td><td style=\"color:#9ca3af;\">Group requests by scope (vibetools vs product) for shared context.</td></tr><tr><td style=\"padding:10px 0;color:#e5c07b;\"><strong>5. Keep sessions warm</strong></td><td style=\"color:#9ca3af;\">Activity within 5 min extends cache TTL.</td></tr></table></div><div style=\"background:#2d2d2d;padding:16px;border-radius:6px;margin-bottom:16px;\"><h3 style=\"margin:0 0 12px 0;color:#61afef;\">Proposed Context Structure</h3><pre style=\"margin:0;font-family:monospace;font-size:11px;line-height:1.5;color:#abb2bf;background:#1a1a1a;padding:12px;border-radius:4px;\"><span style=\"color:#7f848e;\">// Layer 1: Universal (cached across ALL requests)</span>\n<span style=\"color:#98c379;\">[CACHE_CONTROL: ephemeral]</span>\n- Base system prompt (persona, safety)\n- Tool definitions\n- Output format requirements\n\n<span style=\"color:#7f848e;\">// Layer 2: Scope-specific (cached within scope)</span>\n<span style=\"color:#98c379;\">[CACHE_CONTROL: ephemeral]</span>\n- IF vibetools: konui/konsole docs\n- IF product: listings/CMS docs\n- Relevant CLAUDE.md sections\n\n<span style=\"color:#7f848e;\">// Layer 3: Session (not cached, changes each turn)</span>\n<span style=\"color:#f97316;\">[NO CACHE]</span>\n- Conversation history\n- Current working context\n- User's prompt</pre></div><div style=\"background:#2d2d2d;padding:16px;border-radius:6px;margin-bottom:16px;\"><h3 style=\"margin:0 0 12px 0;color:#f97316;\">Quick Win: Quick Turn Caching</h3><p style=\"margin:0 0 12px 0;color:#9ca3af;\">Quick Turn is stateless - perfect for caching:</p><pre style=\"margin:0;font-family:monospace;font-size:11px;line-height:1.5;color:#abb2bf;background:#1a1a1a;padding:12px;border-radius:4px;\"><span style=\"color:#7f848e;\">// Every Quick Turn request uses SAME system prompt</span>\nconst QT_SYSTEM = `You are a fast Q&A assistant.\nAnswer briefly and directly.\nNo tools, no file access, just knowledge.`;\n\n<span style=\"color:#7f848e;\">// This ~50 tokens gets cached, 90% savings on every QT</span></pre></div><div style=\"background:#2d2d2d;padding:16px;border-radius:6px;margin-bottom:16px;\"><h3 style=\"margin:0 0 12px 0;color:#98c379;\">Expected Savings</h3><table style=\"width:100%;border-collapse:collapse;font-size:13px;\"><tr style=\"border-bottom:1px solid #444;\"><th style=\"text-align:left;padding:8px 0;color:#7f848e;\">Scenario</th><th style=\"text-align:right;padding:8px 0;color:#7f848e;\">Before</th><th style=\"text-align:right;padding:8px 0;color:#7f848e;\">After</th><th style=\"text-align:right;padding:8px 0;color:#7f848e;\">Savings</th></tr><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:8px 0;\">Quick Turn (stateless)</td><td style=\"text-align:right;color:#9ca3af;\">100%</td><td style=\"text-align:right;color:#98c379;\">10%</td><td style=\"text-align:right;color:#98c379;\">90%</td></tr><tr style=\"border-bottom:1px solid #444;\"><td style=\"padding:8px 0;\">Full turn (same scope)</td><td style=\"text-align:right;color:#9ca3af;\">100%</td><td style=\"text-align:right;color:#98c379;\">40%</td><td style=\"text-align:right;color:#98c379;\">60%</td></tr><tr><td style=\"padding:8px 0;\">Full turn (scope switch)</td><td style=\"text-align:right;color:#9ca3af;\">100%</td><td style=\"text-align:right;color:#e5c07b;\">70%</td><td style=\"text-align:right;color:#e5c07b;\">30%</td></tr></table></div><div style=\"background:#1a2332;border:1px solid #61afef;padding:16px;border-radius:6px;\"><h4 style=\"margin:0 0 8px 0;color:#61afef;\">Implementation Path</h4><ol style=\"margin:0;padding-left:20px;line-height:1.8;color:#9ca3af;\"><li><strong>Quick Turn:</strong> Add fixed system prompt with cache_control header</li><li><strong>Konsole:</strong> Layer context with stable prefix first</li><li><strong>Monitor:</strong> Track cache_read_input_tokens in StatusLine data</li><li><strong>Optimize:</strong> A/B test context orderings for best cache hit rate</li></ol></div></div>",
    "requestedAt": "2026-01-05T04:02:00Z",
    "requestId": "16086a6a-5b7b-4094-b286-bcbe7fd5eee4",
    "choices": [
      {
        "label": "Implement QT caching",
        "value": "Add prompt caching to Quick Turn - fixed system prompt with cache_control header"
      },
      {
        "label": "Add to backlog",
        "value": "Add prompt caching strategy to the VIBE.md backlog"
      },
      {
        "label": "Monitor current usage",
        "value": "Check current cache hit rates from StatusLine data"
      }
    ]
  },
  "createdBy": "claude",
  "createdAt": "2026-01-05T04:16:13.935Z",
  "updatedAt": "2026-01-05T04:16:14.141Z",
  "requestId": "16086a6a-5b7b-4094-b286-bcbe7fd5eee4",
  "scope": "vibetools",
  "tags": [
    "performance",
    "caching",
    "architecture"
  ],
  "targetUser": "claude"
}