Almost all API-based AI services charge based on tokens. But what exactly are tokens? And how do you ensure costs do not unexpectedly escalate when you scale your AI usage? This article explains the basics and offers concrete tips for cost management.
Tokens are the unit of measurement for AI language models. Every word, every punctuation mark and every space a model processes or produces costs tokens. Understanding how tokens work also helps you understand why some AI applications are cheaper than others, and how you can manage costs.
A token is not a word and not a letter. It is a piece of text that the model processes as a unit. In English, most short words are one token. Longer words are split into multiple tokens. Punctuation marks and spaces are also tokens.
As a rule of thumb: 100 words of English text are approximately 130-150 tokens. Dutch or other European texts are generally slightly more expensive per word than English, because the tokeniser most models use is optimised for English.
AI models charge for two streams: input and output. Input is the tokens you send to the model: the system prompt, the conversation history and the user's current message. Output is the tokens the model sends back as a response.
Output tokens typically cost two to five times more than input tokens. That makes longer answers relatively expensive. If you have a chatbot that gives extensive answers, you pay significantly more than a chatbot that responds concisely.
The most costly situations:
Use smaller models for simple tasks: Claude Haiku, GPT-4o mini and similar compact models are a fraction of the price of the most powerful variants. For FAQ chatbots and simple tasks, that is more than sufficient.
Limit conversation history: Do not send the full conversation history when it is not needed. A summary of earlier messages instead of the literal text saves tokens.
Compress your system prompt: Test whether a shorter, less extensive system prompt works just as well. Every token in the system prompt counts with every request.
Use prompt caching: Anthropic and OpenAI both offer forms of caching where repeated input is processed more cheaply. This is relevant if you have a long system prompt that is the same with every request.
Before building an AI application, it is wise to make a cost estimate. How many conversations do you expect per day? What is the average length of a conversation? How long are your system prompt and context?
With those inputs you can calculate how many tokens you consume per day and what that costs. Most providers have a tokeniser tool where you can enter text and see the token count.
Understanding token usage is essential for anyone building AI applications at scale. Costs are manageable with the right architectural choices. Mach8 helps organisations design AI systems that not only work well, but are also cost-efficient.
Want to build a cost-efficient AI application? Get in touch with Mach8.
We help you go from strategy to implementation. Schedule a no-obligation call.
Schedule a call