Over ons 🤖

Laten we elkaar leren kennen

Vertel me de missie en visie

Leg het verhaal achter Mach8 uit

Stel een vraag!

Hallo daar 👋

Hoe kunnen we je helpen?

Volledige naam

E-mail

Bericht

Mijn gegevens mogen worden gebruikt om me op de hoogte te houden van relevant nieuws van Mach8

Bellen

+31 13 71 13 708

•

E-mail

innovation@mach8.io

Knowledge base›AI Tools & Technology

AI Tools & Technology·6 min·4 May 2025

What is prompt caching and how does it save API costs?

If you build an AI application where the system prompt is the same with every request, you pay every time for processing that same text. Prompt caching solves that: repeated input is processed more cheaply, and response time decreases.

Prompt caching is a feature that several major AI providers now offer, including Anthropic and OpenAI. It reduces costs and latency for applications where a large part of the input is the same with every request. For business AI applications at scale, this can deliver significant savings.

How does prompt caching work?

When you send a request to the API, the model processes the entire input again: the system prompt, the knowledge base context and the user message. If the system prompt is long and identical with every request, that is waste: you pay every time for the same thing.

With prompt caching, a portion of the input is stored in the provider's server memory. With the next request, that cached part is not processed again. You pay a reduced rate for the cached tokens: with Anthropic this is typically 90% less than the standard input rate. Response time also drops because processing the cached part is skipped.

When is prompt caching useful?

Prompt caching is most valuable when:

You have a long, fixed system prompt that is the same with every request
You place a large knowledge base or document collection in the context with every request
Frequently used RAG passages often recur in requests
Your chatbot has a long conversation history context that is repeated

If your system prompt is only a few hundred tokens, savings are limited. Caching only pays off when the cached section is substantial: several thousand tokens or more.

How do you implement it with Anthropic?

With Anthropic, you enable caching by adding a cache_control parameter to the sections you want to cache in your API request. You explicitly mark which parts of the input may be cached. That gives you control: you choose which sections are stable enough to cache.

The cache has a limited lifespan (typically five minutes with Anthropic). If the interval between requests is longer than that lifespan, the cache is rebuilt. Plan your request frequency with that lifespan in mind.

How do you implement it with OpenAI?

OpenAI's prompt caching works automatically for long, repeated inputs. You do not need to configure anything extra: the system recognises cached parts and processes them more cheaply. The saving is visible in your usage overview.

The downside of the automatic approach is less control over what is and is not cached. With Anthropic's explicit approach you have more control.

Concrete cost savings

Suppose you have a chatbot with a system prompt of 5,000 tokens and a knowledge base context of 10,000 tokens per request. That is 15,000 input tokens processed with every request. At 10,000 requests per day and an input rate of $3 per million tokens, that costs $450 per day.

With prompt caching, where 90% of cached tokens are reduced, you pay for 1,500 tokens at normal rate and 13,500 tokens at 10% of the rate. Daily costs drop to approximately $45 for the cached parts, plus the normal rate for the rest.

What are the limitations?

Caching only works for identical input. If the system prompt or knowledge base context varies slightly per request, the cache is not hit. Variable parts of your prompt must not sit in the cached section.

Caching is also less useful at low volumes. The savings only become significant at hundreds or thousands of requests per day.

Conclusion

Prompt caching is a practical way to lower API costs for AI applications at scale. It requires limited technical implementation but can considerably reduce costs at high usage. Mach8 applies prompt caching as standard for production clients where it is applicable.

Want to make your AI application more cost-efficient? Get in touch with Mach8.