Claude just rolled out prompt caching in the Anthropic API which cuts API input costs by up to 90% and reduces latency by up to 80%. Claude 3.5 Sonnet is one of the most advanced LLM available, it is also one of the most expensive models. The current price for their sonnet model is $3 per million input tokens and $15 per million output tokens, which is substantial.
Who is Prompt Caching useful for?
Prompt caching is useful for use cases or applications where you have a repeated use of a single prompt.
- For AI Assistants such as Perplexity, Bind AI, Notion AI which use Claude models and expect multiple users to enter the same prompt.
- For Code Generation use cases where you need to reuse the same prompt, or have multiple users use the same template. Here are some example Prompts
- For Web Search use cases, especially for tools such as Perplexity, where you expect to query the same information along with the same context multiple times.
How does Claude Prompt Caching Work
To use Claude prompt caching you need to use the Anthropic API and add the cache control attribute to the content you want to cache. This is what it looks like, along with its header
“cache_control”: {“type”: “ephemeral”}
“anthropic-beta”: “prompt-caching-2024-07-31”
When you make an API call with these additions, it checks if the designated parts of your prompt are already cached from a recent query. If so, it uses the cached prompt, speeding up processing time and reducing costs.
Costs for Prompt Caching
The initial API call costs $3.75 per million tokens, which is a bit expensive and it accounts for storing the prompt in the cache. After the cache, all subsequent calls are one-tenth the normal price, $0.3 per million tokens per read.
Prompt caching works in multi-turn conversations too. You can progressively move the cache control breakpoints to cache previous turns as the conversation advances. This is useful in combo with features like Tool Use, which may add many tokens to the context window each turn.
Faster response Latencies
You can get upto 79% faster responses for cached prompts with a potential 90% cost reduction. Claude expects further latency improvements over the coming weeks, particularly for shorter prompts of a few thousand tokens in length.
– Cache lifetime (TTL) is 5 minutes, resetting with each use
– Prompts are cached at 1024-token boundaries
– You can define up to 4 cache breakpoints in a prompt
– Support for caching prompts shorter than 1024 tokens is coming soon
Here are some helpful links to read more about it
- An interactive artifact that calculates the cost savings.
- Here’s a code example in the Anthropic Cookbook:
- Claude documentation for Prompt Caching
- Tweet from their Dev relations.