Summary:
OpenAI launched gpt-realtime, its most advanced speech-to-speech model, and made the Realtime API generally available with new features like SIP telephony support.
This update threatens many voice AI startups by commoditizing voice interfaces, especially those relying on basic telephony without deep expertise.
The model offers faster responses, better emotion recognition, and handles complex conversations, but comes with high costs and limited control compared to chained models.
T-Mobile is already using gpt-realtime for customer support, seeing improvements in handling device upgrades and multimodal inputs.
Experts warn that startups need to differentiate or specialize in advanced integrations to survive in the evolving AI landscape.
OpenAI has released its most advanced speech-to-speech model yet, gpt-realtime, alongside the general availability of its Realtime API with new capabilities. This move aims to empower enterprises and developers to build production-ready voice agents, particularly for scenarios like customer support.
Key Features and Implications
The Realtime API now supports image inputs and remote MCP servers, enhancing agent capabilities. A standout addition is SIP telephony support, which simplifies building applications for voice-over-phone situations. As Peter Bakkum, Member of Technical Staff at OpenAI, stated in the announcement video:
We've added support for SIP telephony, which makes it much easier to build applications for voice-over-phone situations like customer support.
This development poses a significant threat to many conversational AI startups. Andreas Granig, CEO at Sipfront, highlighted in a LinkedIn post that startups relying on basic telephony interfaces without deep telco expertise are now at risk, as the voice interface for AI assistants has become a commodity.
Advantages of the gpt-realtime Model
OpenAI designed gpt-realtime for real-world scenarios like customer support and academic tutoring. It enables AI agents to understand and produce audio without separate transcription, language, or voice models, leading to faster responses and better capture of subtleties like emotions (e.g., laughter or sighs). The model delivers more natural, high-quality audio and handles complex, multi-turn conversations effectively. Developers can adjust pace, tone, style, and even roleplay characters, and it excels with unclear audio and long alphanumeric strings, such as phone numbers.
Cost and Control Considerations
Despite its benefits, the model comes with a high cost: $32 per 1M audio input tokens ($0.40 for cached tokens) and $64 per 1M audio output tokens. Alex Levin, CEO at Regal, noted that this is approximately four times higher than using a chained model (speech-to-text, LLM, text-to-speech). Additionally, there are concerns about limited control and observability compared to chained models, which allow for varying models, voices, and guardrails during conversations.
Real-World Application: T-Mobile's Use Case
T-Mobile has been testing OpenAI's models for six months and recently started using gpt-realtime with the Realtime API, reporting huge improvements. Julianne Roberson, Director of AI at T-Mobile, demonstrated how the AI assistant guides customers through processes like device upgrades, handling unpredictable conversations, recognizing emotions, and managing multimodal inputs. This aligns with T-Mobile's goal to provide expert-level service everywhere with AI, potentially accelerating trends toward automated customer service.
Comments