Prompt caching: how we reduced LLM spend by 60x (and Get 20 % Faster Responses)

TL;DR — Caching the static prompt prefix (tools + system + stable memory) in our E2E testing agent delivered 60x lower cost per test and ~20% lower p95 latency, with the same "accuracy".

Bugster

and

Naquiao

Aug 13, 2025

Context

Bugster is an agent that drives a real browser to test apps end-to-end. That means multi-step loops with a lot of repeated context: tools, rules, DOM summaries, session state.

Ahead of our Product Hunt launch we ran a one-hour stress test (users + synthetic load). Functionally we were solid (TCR 80.2%), but two things popped:

We were hitting the max Input Tokens/min on our top LLM tier.
Cost/test was too high to scale.

Queues? Bad DX. Heavy memory pruning? Risky this close to launch. We needed a surgical win.

The lever: prompt caching

Prompt caching lets the API reuse the prefix you repeat across calls. In Claude, the cached prefix is evaluated in this order: tools → system → messages (i.e., everything before your cache breakpoint). You mark that breakpoint with cache_control. Anthropic

Why it works:

First call: you pay a small write premium on the cached prefix.
Subsequent calls: you pay ~10% of base input for the cached part (cheap) and full price only for the delta you add. Pricing multipliers are documented. Anthropic

Since Mar 13, 2025, Claude doesn’t count cache-read tokens toward ITPM on 3.7 Sonnet → more throughput without changing tiers. OTPM is unchanged.

Results in our agent

Cost: 60x reduction per test (depends on how prefix-heavy the loop is).
Latency: −20% p95 (less work per request + less queuing at ITPM).
Quality: TCR stayed flat (80.1–80.5%).
Rate limits: no longer spiking ITPM thanks to cache-aware reads. Anthropic

Reality check: the documented ceiling on input savings is about 10× (cache read ≈ 0.1× base input; write ≈ 1.25× for 5-min TTL or 2× for 1-hour TTL). Your overall test cost reduction depends on output tokens and how much of your prompt is cacheable. Anthropic

Quick implementation (TypeScript, Claude Messages API)

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

// 1) Define tools as usual (these are part of the prefix, automatically included in caching)
const tools = [
  {
    name: "click",
    description: "Click a CSS/XPath selector in the browser",
    input_schema: {
      type: "object",
      properties: { selector: { type: "string" } },
      required: ["selector"]
    }
  },
  {
    name: "type_text",
    description: "Type text into a selector",
    input_schema: {
      type: "object",
      properties: { selector: { type: "string" }, text: { type: "string" } },
      required: ["selector", "text"]
    }
  }
];

// 2) Put your long, stable instructions in `system`
// 3) Mark the END of the static prefix with cache_control
const system = [
  { type: "text", text: "You are Bugster, an E2E testing agent. Use tools only. Be terse." },
  { type: "text", text: "<...lots of stable rules / few-shots / DOM policy...>", cache_control: { type: "ephemeral" } }
];

const res = await client.messages.create({
  model: "claude-3-7-sonnet-20250219",
  max_tokens: 512,
  tools,
  system,
  messages: [
    // dynamic part starts AFTER the breakpoint
    { role: "user", content: "Step 14: open the calendar and pick Aug-13." }
  ]
});

// Inspect caching effectiveness:
console.log(res.cache_creation_input_tokens, res.cache_read_input_tokens);

Notes:

Order matters conceptually (cache builds prefix as tools → system → messages). You only need the breakpoint once at the end of the static block.
Pricing multipliers (5-min TTL): write 1.25×, read 0.1× base input. 1-hour TTL: write 2×. Anthropic

10-minute rollout checklist

Identify the static prefix: tool defs, system prompt, stable memory, canonical examples.
Place it first, then add a single cache_control block at the end.
Keep dynamic content after the breakpoint.
Measure cache_creation_input_tokens / cache_read_input_tokens and hit-rate.
Tune TTL: 5-min (default) or 1-hour (beta/GA depending on model).

Wrap-up

This wasn’t “optimization theater.” We simply stopped paying over and over for the same prefix. If your agent repeats tool specs and instructions each step, prompt caching is likely the highest-ROI switch you can flip today.

A guest post by