OpenAI’s GPT-4o: Advancing Multimodal AI with Enhanced Safety and Performance
OpenAI has unveiled its latest flagship model, GPT-4o, which integrates text, audio, and visual inputs and outputs, aiming to enhance the naturalness of machine interactions.
GPT-4o, with “o” representing “omni,” is engineered to accommodate a wide range of input and output modalities. “It accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs,” OpenAI stated.
Users can look forward to a response time as fast as 232 milliseconds, comparable to human conversational speed, with an average response time of 320 milliseconds.
Pioneering Capabilities
The introduction of GPT-4o signifies a significant advancement from its predecessors by handling all inputs and outputs through a single neural network. This design allows the model to maintain crucial information and context that were previously lost in the separate model pipelines used in earlier versions.
Previously, ‘Voice Mode’ dealt with audio interactions with latencies of 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. The older setup utilized three distinct models: one for transcribing audio to text, another for generating textual responses, and a third for converting text back to audio. This division led to the loss of nuances such as tone, multiple speakers, and background noise.
As a unified solution, GPT-4o shows significant improvements in understanding vision and audio. It can undertake more complex tasks like harmonizing songs, providing real-time translations, and even generating outputs with expressive elements such as laughter and singing. Examples of its extensive capabilities include interview preparation, on-the-fly language translation, and customer service response generation.
Nathaniel Whittemore, Founder and CEO of Superintelligent, remarked: “Product announcements tend to be more contentious than technology announcements because it’s harder to determine if a product is genuinely different until you actually use it. Especially with a new mode of human-computer interaction, there’s more room for varied opinions on its utility.
“Nevertheless, the absence of a GPT-4.5 or GPT-5 announcement is diverting attention from the significant technological advancement of this model. It’s not just a text model with voice or image additions; it’s inherently multimodal. This opens up a wide range of use cases that will take time to be fully appreciated.”
Performance and Safety
GPT-4o matches GPT-4 Turbo’s performance in English text and coding tasks but excels significantly in non-English languages, making it a more inclusive and versatile model. It sets a new benchmark in reasoning with a high score of 88.7% on 0-shot COT MMLU (general knowledge questions) and 87.2% on the 5-shot no-CoT MMLU.
The model also outperforms previous state-of-the-art models like Whisper-v3 in audio and translation benchmarks. In multilingual and vision evaluations, it demonstrates superior performance, enhancing OpenAI’s capabilities in these areas.
OpenAI has embedded robust safety measures in GPT-4o from the start, using techniques to filter training data and refining behavior through post-training safeguards. The model has been assessed using a Preparedness Framework and adheres to OpenAI’s voluntary commitments. Evaluations in areas like cybersecurity, persuasion, and model autonomy indicate that GPT-4o does not exceed a ‘Medium’ risk level in any category.
Further safety evaluations involved extensive external red teaming with over 70 experts in fields such as social psychology, bias, fairness, and misinformation. This comprehensive review aims to mitigate risks associated with the new modalities of GPT-4o.
Availability and Future Integration
Starting today, GPT-4o’s text and image capabilities are available in ChatGPT, including a free tier and enhanced features for Plus users. A new Voice Mode powered by GPT-4o will begin alpha testing within ChatGPT Plus in the coming weeks.
Developers can access GPT-4o through the API for text and vision tasks, benefiting from its doubled speed, halved price, and increased rate limits compared to GPT-4 Turbo.
OpenAI plans to extend GPT-4o’s audio and video functionalities to a select group of trusted partners via the API, with a broader rollout expected soon. This phased release strategy aims to ensure thorough safety and usability testing before making the full range of capabilities publicly available.
“It’s hugely significant that they’ve made this model available for free to everyone, as well as making the API 50% cheaper. That is a massive increase in accessibility,” Whittemore explained.
OpenAI encourages community feedback to continuously improve GPT-4o, emphasizing the importance of user input in identifying and addressing gaps where GPT-4 Turbo might still outperform.
