Categories
Anthropic LLM Code Generation

What is Claude Prompt Caching? How does it work?

Claude just rolled out prompt caching in the Anthropic API which cuts API input costs by up to 90% and reduces latency by up to 80%. Claude 3.5 Sonnet is one of the most advanced LLM available, it is also one of the most expensive models. The current price for their sonnet model is $3 per million input tokens and $15 per million output tokens, which is substantial.

Who is Prompt Caching useful for?

Prompt caching is useful for use cases or applications where you have a repeated use of a single prompt. 

How does Claude Prompt Caching Work

To use Claude prompt caching you need to use the Anthropic API and add the cache control attribute to the content you want to cache. This is what it looks like, along with its header

“cache_control”: {“type”: “ephemeral”}

“anthropic-beta”: “prompt-caching-2024-07-31”

 

When you make an API call with these additions, it checks if the designated parts of your prompt are already cached from a recent query. If so, it uses the cached prompt, speeding up processing time and reducing costs.

Costs for Prompt Caching

The initial API call costs $3.75 per million tokens, which is a bit expensive  and it accounts for storing the prompt in the cache. After the cache, all subsequent calls are one-tenth the normal price, $0.3 per million tokens per read.

Prompt caching works in multi-turn conversations too. You can progressively move the cache control breakpoints to cache previous turns as the conversation advances. This is useful in combo with features like Tool Use, which may add many tokens to the context window each turn.

Faster response Latencies

You can get upto 79% faster responses for cached prompts with a potential 90% cost reduction. Claude expects further latency improvements over the coming weeks, particularly for shorter prompts of a few thousand tokens in length.

– Cache lifetime (TTL) is 5 minutes, resetting with each use

– Prompts are cached at 1024-token boundaries

– You can define up to 4 cache breakpoints in a prompt

– Support for caching prompts shorter than 1024 tokens is coming soon

Here are some helpful links to read more about it