OpenAI’s new voice-AI model gpt-4o transcribe allows you to add speech to existing text apps within seconds

Join our daily and weekday newsletters to receive the latest updates on AI coverage. Learn More


Openai‘s voice AI models may have caused it trouble with Scarlett Johansson in the past, but this isn’t stopping them from continuing to improve their offerings in this area.

The ChatGPT maker has unveiled three new proprietary voice models: gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts. These models will be initially available through ChatGPT maker’s application programming API for third-party developers to build their apps. They will be available on a special demo site. OpenAI.fm () is a website that allows users to test and play with OpenAI.

The gpt-4o mini-tts voice model can be customized via text prompts to change their accents and pitch, tone, and other vocal qualities. They can also convey whatever emotions the user requests. This should help to dispel any concerns that OpenAI is intentionally imitating a particular user’s voices (the company denied this was the case for Johansson but removed the ostensibly imitation voice option anyway). It’s now up to the user how they want their AI to sound when speaking.

OpenAI technical staff Jeff Harris demonstrated over a video chat with VentureBeat how a user can get the same voice sounding like a cackling mad science or a zen calm yoga teacher using only text on the demo website.

Discovering and refining capabilities within GPT-4o

These models are variants from the existing GPT-4o OpenAI model launched in May 2024, which powers the ChatGPT text-and-voice experience for many users. The company took this base model and trained it with additional data so that it excelled at transcription and speech. The company did not specify when these models would be available for ChatGPT. Harris stated that ChatGPT had slightly different requirements for cost and performance tradeoffs. While I expect them to move to these models over time, this launch is aimed at API users. It is intended to replace OpenAI’s Whisper open-source, text-to-speech, two-year-old model. It offers lower word error rates in industry benchmarks, improved performance with accents and speech speeds, and better performance in noisy environments.

On its website, the company published a chart showing how much lower Whisper’s error rates were compared to gpt-4o. The English error rate was impressively low at only 2.46%.

“These models include noise cancellation and a semantic voice activity detector, which helps determine when a speaker has finished a thought, improving transcription accuracy,” said Harris.

Harris told VentureBeat that the new gpt-4o-transcribe model family is not designed to offer “diarization,” or the capability to label and differentiate between different speakers. Instead, it is designed primarily to receive one (or possibly multiple voices) as a single input channel and respond to all inputs with a single output voice in that interaction, however long it takes.

The company is also hosting a competition for the general public to find the most creative examples of using its demo voice site OpenAI.fm and share them online by tagging the @openAI account for XThe winner will get a Teenage Engineering radio customized with theOpenAI Logo, which OpenAI Head Product, Platform Olivier Godement says is one of just three in the entire world.

Audio applications goldmine

These enhancements make the models particularly well-suited to applications such as call centers for customers, meeting note transcriptions, and AI-powered personal assistants.

The company’s newly released Agents SDK last week allows developers who have built apps on its text-based large languages models like the regular GPT-4o, to add fluid voice interaction with only “nine lines” of code. This was revealed by a presenter in an OpenAI YouTube Livestream announcing these new models (embedded below). These new models allow an ecommerce app built on GPT-4o to respond to user questions such as “Tell me about your last orders” by speaking. This can be done in just a few seconds of code tweaking. “For the very first time, we are introducing streaming text-to-speech, allowing developers continuous audio input and receive a realtime text stream. This makes conversations feel more natural,” Harris explained.

For developers looking for real-time AI voice experiences with low latency, OpenAI recommends its speech-tospeech models within the Realtime API.

Pricing and availability

The new models are available immediately via OpenAI’s API, with pricing as follows:

* gpt-4o-transcribe: $6.00 per 1M audio input tokens (~$0.006 per minute)

* GPT-4o-mini-Transcribe: $3.00 per 1M audio input tokens (~$0.003 per minute)

* GPT-4o-Mini-Tts: $0.60 per 1M text input tokens, $12.00 per 1M audio output tokens (~$0.015 per minute)

However, they arrive at a time of fiercer-than-ever competition in the AI transcription and speech space, with dedicated speech AI firms such as ElevenLabs offering their new Scribe model, which supports diarization and boasts a similarly (but not as low) reduced error rate of 3.3% in English. It costs $0.006 per minute (or $0.40 per hour) of audio input.

Hume AI offers Octave TTS with customizable pronunciation and emotion inflection based on the user’s input, not pre-set voices. Pricing of Octave isn’t directly comparable. However, there is a free level offering 10 minutes of audio, and costs increase between

Meanwhile more advanced audio models are coming to the open-source community, including a model called Orpheus 3B is available under a permissive Apache2.0 licensewhich means developers do not have to pay for it – provided they have the correct hardware or cloud servers.

Early results and industry adoption

According the testimonials provided by OpenAI to VentureBeat, a number of companies have already integrated OpenAI’s new audio models in their platforms. They report significant improvements in voice AI.

EliseAI – a company specializing in property management automation – found that OpenAI’s text-to speech model allowed for more natural and emotional interactions with tenants.

The enhanced voice made AI-powered leasing and maintenance more engaging. This led to higher tenant satisfaction rates and improved call resolution. Decagon, a company that creates AI-powered voice experiences for customers, saw a 30% increase in transcription accuracy when using OpenAI’s Speech Recognition model.

The increase in accuracy allows Decagon’s agents to perform better in real-world scenarios even in noisy environments. Decagon integrated the new model within a single day.

OpenAI’s newest release has not been met with universally positive reactions. Dawn AI co-founder and app analytics software developer Ben Hylak, a former Apple designer of human interfaces,posted on X, that while the models look promising, the announcement feels like a retreat away from real-time speech, suggesting a change from OpenAI’s prior focus on low latency conversational AI through ChatGPT.

The launch was preceded with an early leak via X (formerly Twitter). TestingCatalog News (@testingcatalog) posted details on the new models several minutes before the official announcement, listing the names of gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe. The leak was credited @StivenTheDev and the post quickly gained popularity.

Looking ahead, OpenAI plans on refining audio models and exploring customized voice capabilities, while ensuring safety, and responsible AI usage. OpenAI invests in multimodal AI including video to create dynamic and interactive experiences.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.


www.aiobserver.co

More from this stream

Recomended