Over ons 🤖

Laten we elkaar leren kennen

Vertel me de missie en visie

Leg het verhaal achter Mach8 uit

Stel een vraag!

Hallo daar 👋

Hoe kunnen we je helpen?

Volledige naam

E-mail

Bericht

Mijn gegevens mogen worden gebruikt om me op de hoogte te houden van relevant nieuws van Mach8

Bellen

+31 13 71 13 708

•

E-mail

innovation@mach8.io

Knowledge base›Implementation & Technology

Implementation & Technology·6 min·4 May 2025

Rate limiting and throttling with AI APIs: how do you handle it?

Every AI API has limits: maximum calls per minute, maximum tokens per day, maximum concurrent connections. Ignoring those limits means your application will fail at the most inconvenient moment.

Rate limiting is one of the most underestimated challenges when building AI applications. During development it barely matters. In production, when dozens or hundreds of users are making calls simultaneously, it becomes a serious architectural question. This article explains how to handle it well.

What is rate limiting with AI APIs?

Rate limiting means that an API provider restricts the number of calls you can make within a given time period. AI APIs typically have multiple levels:

Requests per minute (RPM): How many API calls you may make per minute.
Tokens per minute (TPM): How many tokens (input + output) you may process per minute.
Tokens per day (TPD): A daily ceiling on token usage.

OpenAI, Anthropic, and other providers apply all of these limits, and they vary by subscription tier. If you exceed a limit, you receive a 429 Too Many Requests error.

Throttling: slowing yourself down deliberately

Throttling means proactively slowing down your calls so you do not reach the provider's limits. This is better than waiting until you receive an error and then executing a retry.

A simple approach is maintaining a counter that tracks how many calls you have made in the past minute. When you approach the threshold, you introduce a small delay. Libraries like bottleneck (Node.js) or ratelimiter (Python) do this for you.

Token throttling is more complex: you need to track how many tokens your prompts cost, which depends on the model choice. Most API responses include usage information you can use to track your consumption.

Queues for high volumes

When you make many calls in parallel — when batch-processing content or serving many users simultaneously — a queue is the more robust solution. You add tasks to a queue, and a worker processes them at a controlled pace that stays within the API limits.

Popular options are Redis with BullMQ (Node.js) or Celery (Python). This also gives you retry capabilities when a task fails, and you can assign priorities to urgent tasks.

Caching to avoid duplicate calls

Many AI calls are actually duplicates: the same question is asked by multiple users, or the same batch job processes overlapping data. With a cache you store the response of a call and return it for the same input without calling the API again.

Note: caching is only useful for deterministic or near-deterministic calls. If your temperature is set to 0 and the prompt does not change, the output is identical. At higher temperatures the output varies and caching is less effective.

Setting per-user limits

If your application serves multiple users, it is wise to set individual limits per user. This prevents one heavy user from consuming all quota at the expense of others. You can set limits based on subscription tier, usage history, or a fair-use policy.

Monitoring API usage

Track how many tokens and calls you consume per day. Most providers offer a dashboard, but it is better to also track this yourself in your own logging. This lets you see trends, flag anomalies, and know when to increase your limits.

Set alerts when you reach 80% of your daily quota. That gives you time to adjust before hitting a hard limit.

Conclusion

Rate limiting and throttling are not secondary concerns but fundamental aspects of a stable AI application. At Mach8, we account for API limits in every project and build queues and monitoring so our clients are never caught off guard.

Want to know how Mach8 builds scalable AI systems? View our AI agents service or get in touch.