In this blog, we will explore:
- What the Realtime API Does
- Why the Realtime API Matters for Voice AI
- Key Features of the Realtime API
- How Do Realtime Voice Agents Connect to Applications and Phone Systems?
- How Realtime AI Voice Agents Are Deployed
- Conversation Flow Design for AI Voice Agents
- Limitations of the Realtime API
The OpenAI realtime API is now production ready and has a bunch of new features compared to the previous preview model.
This production ready model has MCP support. It can accept images as input. It can also accept video as input. It has SIP integration, which means it can plug into internet-based phone calling systems. It is 20% cheaper than the previous model. It is better at following instructions. It also has some new voices, and the voices are more expressive.
The Realtime API is a powerful development that can change voice AI in a major way. It enables low latency multimodal conversational experiences using OpenAI. In simple terms, it makes it possible to have conversations in both text and audio with the same kind of API and build applications that can switch between them in a seamless and extremely fast experience.
What the Realtime API Does
What makes the Realtime API so powerful is that it cuts a lot of things out of the equation of what has currently been called a voice orchestration layer.
In the current AI based call center set-ups, voice orchestration layers consist of speech to text systems that develop textual form of voice interaction and then it is sent to LLM for response synthesis and then textual form of response is converted into speech to respond back. All of this had to happen in a very short amount of time, and latency had to be optimized to make it smooth so it seemed like talking to an actual human.
With the introduction of the Realtime API, there is a more powerful system in place that can literally take speech to speech. It basically cuts out the equation of translating things to text first and translating them back, which saves quite a bit of time.
Not just that, it can also provide better throughput. Another important part is that better emotional understanding can also be expected, because when interactions do not have to pass through text first, a lot of things like emotions are less likely to get lost. This gives the AI the possibility to interpret those things better than with normal voice orchestration layers.
Explore More: What Is API Integration with Salesforce? Concepts, Types, and Examples
Why the Realtime API Matters for Voice AI
This is obviously incredibly powerful for the future of voice AI.
A system like this can lead to even better experiences and software being built in the voice AI space. It can make interactions more natural, more empathetic, and a lot better for people who have never interacted with voice systems before. Because of the speed and flow of conversation, it can sound and feel even more natural and even more human-like.
The Realtime API is especially important because it supports low latency conversational experiences, which is one of the biggest things that affects how natural a voice interaction feels.
Key Features of the Realtime API
The key features Include:
- multimodal capabilities
- MCP support
- expressive voices
- Speed and Performance
- Integration Possibilities
- Tool Calling
Real-Time Multimodal Capabilities
A basic browser agent can take photos of a webcam or take a screenshot of a browser, ingest it into the conversation, have the agent process that image in real time, and then respond about it as well.
This kind of application can be deployed in the browser using WebRTC as the structure and facilitator for the agent interaction. In the same way it can be deployed on a browser for a computer, it can also be deployed from a mobile phone. The same voice agent can be plugged into a phone using the same WebRTC framework, with permission for microphone and camera.
The agent can also work in browser assistant mode, where a browser screen is shared and the agent can take the screen and describe what it sees. Images can be sent into the conversation without clicking buttons in some cases. With a capture button, an image can be sent automatically into the conversation and a tag can indicate that the image was sent. In a browser assistant setup, if a certain phrase is used, the agent can listen to the events, dynamically take a screenshot, and send it into the conversation.
Asynchronous MCP Calling
One of the features of this model is asynchronous MCP calling.
Previously, there was a need to wait for the API call to complete before the agent would respond and speak. Now the API call or MCP call can be triggered in the background, and the conversation can continue while the result is processed. A few seconds later, the result can be requested. This makes the conversation very fluid.
Expressiveness, Tone, and Pronunciation
The production ready model is meant to be more expressive. It gets closer to highly expressive voice systems.
It also includes personality and tone settings. Every time a prompting suggestion is given, it is compared to the previous model and the new model, showing that the new model is better and cheaper as well.
Reference pronunciations are another useful feature. Certain words can be phonetically described so the model knows how they should sound. This improves pronunciation quality.
Retail also has this natively built into its AI caller settings.
Speed and Performance
A major benefit of the Realtime API is speed.
When used in a direct web-based setup, it can be incredibly fast. That is one of the reasons it stands out so much compared to older approaches. The reduction in processing steps allows much faster back-and-forth communication.
However, speed in production also depends on the setup. Once phone systems, cloud communication platforms, or other providers are involved, there can be extra delay because more APIs and systems are interacting with each other. So while it is still very fast, there may still be more delay compared to a simple direct web interaction.
Even then, it can still be faster than many traditional voice orchestration approaches.
Integration Possibilities
The Realtime API can also be integrated with cloud communication platforms and custom servers.
That means it can be connected into broader voice systems and telephony environments. However, doing that directly can require a lot of technical work. There are many things that have to be accounted for when building those systems from scratch, which is why existing platforms that already handle those details still remain valuable.
So while direct integration is possible, ease of implementation depends heavily on the technical setup and the tools being used.
Tool Calling
Another powerful capability is tool calling.
Tool calling is basically the possibility of interacting with the real world during a voice conversation. It allows systems to include tools that perform actions, retrieve information, or support specific use cases during the conversation.
Not just that, extra custom or transient-based tools can also be added into responses for a specific use case. That means capabilities can be tailored depending on the situation, the user type, or the business logic behind the interaction.
This is an incredibly powerful feature because it moves voice AI beyond simple conversation and into actually doing useful work in real time.
How Realtime AI Voice Agents Are Deployed
One of the best use cases for this model is an AI voice agent or a voice caller. There are three different ways to deploy a voice agent.
SIP Integration
The first way is by using SIP integration. This is all custom coded, and if there is a need to take advantage of features like tool calling, MCP, web search, file search, or attaching a RAG database to the agent, all of this has to be custom coded.
The downside is that coding knowledge is required, or coding with AI and reading documentation is required. The upside is ultimate control and lowest latency with best performance.
Cloud Communication Platform Integration
The second method is for landline calls. If there is a business phone number and customers need to call that phone number and plug into a realtime agent, something like a cloud communication platform can be used.
The cloud communication platform facilitates the conversation and the connection between the realtime API and the actual phone number. This also requires custom code, so integrations must be custom coded.
Voice Agent Platform Deployment
The third and simplest way to deploy the voice agent is by using voice agent platforms.
Voice agent platforms like Vapi AI, Biglysales, Bland AI etc. provide a configuration panel that makes setup very easy. The prompt can be input directly, and there are options to choose the model or the brain of the system.
The production ready realtime API model can be selected, and other AI models are also available. Different voices can be chosen, including built-in voices and other third-party provider voices.
This is a really easy tool for anyone who does not want to do coding to deploy a realtime voice agent. The latency is superb, and the system is super quick. Similar latency can be achieved when using these platforms.
These platforms also have a bunch of integrations such as function calling, RAG knowledge base, realtime transcription, webhooks, and MCPs. These integrations are platform-based and are created as infrastructure for voice callers.
That means any other model can still have MCP support because that is part of the platform infrastructure.
The upside is that it is super easy to set up. The downside is that latency might increase and performance might be a little bit lower because the platform infrastructure is being used. However, it is much easier and much simpler to set up and maintain for people who do not know how to code.
How Do Realtime Voice Agents Connect to Applications and Phone Systems?
The three main things to know are the ways that these systems connect.
WebRTC
WebRTC is one of the easiest ways to set up. WebRTC is used for browser-based speech-to-speech voice applications. If there is a mobile phone application or a web browser, both of those would be using WebRTC. WebRTC is essentially a protocol for the communication between the browser and the agent. It is a very simple way to set agents up.
One thing needed in this framework is an ephemeral token. The ephemeral token is essentially a way to authenticate a conversation with the agent, since the API key should not be exposed in the browser. It acts as a security system.
WebSockets
WebSockets are good for server-to-server applications. If there is a third-party app that needs to be integrated into some realtime experience, this is where WebSockets would be used.
SIP integration
SIP integration allows realtime AI systems to connect with internet-based phone calling systems. This is especially important for voice agents that operate inside telephony environments such as call centers.
Conversation Flow Design for AI Voice Agents
Conversation Flow Design refers to structuring the steps and decision paths of a conversation so the AI voice agent knows how to guide and respond to callers during a call. It can be done via”
- State machine approach
- Flow-based AI agents
Conversation Flow as a State Machine
Another useful concept when working with the realtime API is conversation flow as a state machine. While the realtime API handles the real-time voice conversation, the state machine helps control how the conversation is structured and how it moves from one stage to another.
This approach uses a JSON structure to manage the states and transitions between different parts of the conversation.
At a high level, there are different conversation states. The first state might be the greeting. In this stage, the AI voice agent may introduce the company and let the caller know that help is available. The realtime API manages the real-time speech interaction, while the state machine defines the goal and instructions for that stage.
Each state has a specific goal. For example, the greeting state may focus on welcoming the caller and starting the conversation clearly.
Then there are transitions that determine how the system moves from one stage to the next. For example:
- once the greeting is complete, the system moves to step two
- if the customer’s first name is obtained, it moves directly to step three
The realtime API continues handling the live voice interaction, while the state machine decides which step the conversation should move to next.
This same JSON structure continues throughout the conversation, with each step having a different ID, a different goal, and a different set of instructions and examples.
Flow-Based Agent Design
Flow-based agent design follows a similar idea. These flows work like flowcharts or conditional pathways where specific blocks are responsible for tasks such as asking for the customer’s name or collecting information.
Depending on the caller’s response, the system can follow different routes. For example, if the customer’s name is available the conversation may move to the next step, while a missing response may trigger another question.
The realtime API keeps the conversation running smoothly in real time, while the flow logic guides the agent through the correct steps of the call.
This conditional logic approach helps move sequentially through the conversation or process for the phone call and leads to higher accuracy in AI voice interactions.
Limitations of the Realtime API
The limitations of realtime API include higher initial cost and need of technical knowledge for setting it up properly.
Require Technical Background for Setting it up
The Realtime API is powerful, but it is not necessarily easy to set up.
Right now, it is still very technical. Documentation may still be evolving, and there can be missing details that make setup more difficult. For people who do not fully know how to code, going deep into implementation can be very hard.
So while it is great to get started with, it still requires technical understanding, especially for more advanced use cases like tool calling, integrations, or custom production workflows.
Cost Considerations
As of now, the Realtime API can be more expensive than older voice orchestration setups. That means cost can become an important factor depending on the use case. For businesses where price is the main concern, it may make sense to start with more affordable existing options.
But for businesses that value quality, speed, emotional understanding, and a better overall conversational experience, paying a couple of more cents may still be worth it.
A better solution that responds faster and handles emotions better can lead to higher throughput, better conversion, and stronger long-term outcomes. That means the higher cost may be offset by better business performance or greater efficiency over time.
And like many other AI technologies, prices will most likely come down over time as the technology improves and becomes more established.
FAQs about Realtime API
1. Why Is the Realtime API Still Technical to Implement?
Right now, it is still very technical. Documentation may still be evolving, and there can be missing details that make setup more difficult. For people who do not fully know how to code, going deep into implementation can be very hard.
2. How Does Asynchronous MCP Calling Improve AI Voice Conversations?
Previously, there was a need to wait for the API call to complete before the agent would respond and speak. Now the API call or MCP call can be triggered in the background, and the conversation can continue while the result is processed. A few seconds later, the result can be requested. This makes the conversation very fluid.
3. What are the key features of Realtime API?
The key features Include multimodal capabilities, MCP support, expressive voices, Speed and Performance, Integration Possibilities, and Tool Calling



