Over ons 🤖

Laten we elkaar leren kennen

Vertel me de missie en visie

Leg het verhaal achter Mach8 uit

Stel een vraag!

Hallo daar 👋

Hoe kunnen we je helpen?

Volledige naam

E-mail

Bericht

Mijn gegevens mogen worden gebruikt om me op de hoogte te houden van relevant nieuws van Mach8

Bellen

+31 13 71 13 708

•

E-mail

innovation@mach8.io

Knowledge base›Implementation & Technology

Implementation & Technology·6 min·4 May 2025

Streaming responses with LLMs: better user experience with real-time output

Waiting for an AI answer that takes seconds or longer feels slow. With streaming you see the output appearing as the model writes. That changes the user experience significantly, but also introduces technical challenges.

Most LLMs generate text token by token. Without streaming you wait until the model has finished generating before you see a single letter. With streaming every token is forwarded as soon as it is available. For the user the difference is noticeable: instead of a blank page followed by a block of text, they see the words appearing as the model writes.

How does streaming work technically?

Streaming uses Server-Sent Events (SSE) or WebSockets to send data in small chunks from server to client. Most AI APIs support streaming via a simple parameter: with OpenAI that is stream: true, with Anthropic stream: true or "stream" in the API call.

The server sends tokens back as a stream of data packets. The client listens to that stream and appends each packet to the display. When the model is done, the server sends an end signal.

In the browser you use the Fetch API with ReadableStream or a library that abstracts this. In a Node.js backend you use a similar approach but via a server streaming response.

Benefits for the user experience

The biggest advantage of streaming is perceived speed. Even if the total generation time remains the same, streaming feels faster because the user gets immediate feedback. User experience research consistently shows that a progressively loading interface is rated better than a long loading bar followed by abrupt output.

A second advantage is that users can intervene earlier. If the model is heading in the wrong direction, the user can stop the generation without waiting for the model to finish.

When is streaming not the right choice?

Streaming is not always the best approach. In the following cases it is better to wait for the full response:

Structured output: If you need the complete JSON to validate it, streaming has little value.
Backend processing: If you process the AI output before sending it to the user, you need the full text.
Short responses: If the response takes less than a second, streaming adds no noticeable benefit.
Batch jobs: In background processing there is no user waiting for output.

Technical considerations

Streaming introduces a few technical challenges:

Connection stability: A streaming connection that drops halfway leaves the user with an incomplete response. You need a mechanism to detect and handle this.
Buffering: Some infrastructure components buffer data and only forward it when the connection closes. That effectively makes streaming non-streaming. Ensure your server and proxy are correctly configured (disable buffering in Nginx).
Markdown rendering: If you stream markdown to a UI, you want the rendering to update incrementally. This requires a renderer that works incrementally.
Error handling: An error halfway through a streamed response is harder to handle than an error in a standard request-response.

Implementation in the UI

In a React application you typically use a state variable that you update with each incoming token. Libraries like ai from Vercel or the streaming helpers in the AI SDKs simplify this considerably.

Provide an indication that the model is still generating — a blinking cursor or a loading indicator — so the user knows more is coming. Also offer the ability to stop the generation.

Cost and performance

Streaming does not change the cost: you still pay per token. Streaming can slightly increase server load because connections are kept open longer. At high concurrency this is something to consider for your infrastructure capacity.

Conclusion

Streaming is a relatively simple improvement that significantly enhances the user experience of AI applications. At Mach8 we implement streaming as standard in user-facing AI interfaces, because it makes the interaction with AI considerably more pleasant.

Want to build an AI application with a great user experience? View our AI agents service or get in touch.