Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

November 12, 2025

China’s leading search engine giant has unveiled a cutting-edge artificial intelligence model that its creators assert surpasses rivals from Google and OpenAI on multiple vision-related benchmarks, all while utilizing a fraction of the computational power typically demanded by such advanced systems.

Named ERNIE-4.5-VL-28B-A3B-Thinking, this AI represents a significant leap in the race to develop multimodal systems capable of interpreting and reasoning over images, videos, and documents in addition to text. These capabilities are becoming increasingly vital for enterprise applications such as automated document analysis, industrial inspection, and complex workflow automation.

Revolutionizing Visual Reasoning with Human-Like Image Interaction

What distinguishes Baidu’s latest model is its innovative “Thinking with Images” feature, which enables the AI to dynamically zoom in and out of images, closely emulating human visual problem-solving strategies. Unlike conventional vision-language models that analyze images at a fixed resolution, this dynamic approach allows the system to capture both broad context and intricate details, enhancing its ability to interpret complex visuals such as technical schematics or subtle manufacturing defects.

Additionally, the model boasts advanced visual grounding capabilities, enabling precise identification and localization of objects within images. This functionality is particularly promising for applications in robotics, warehouse automation, and quality assurance, where accurate object detection and spatial understanding are critical.

Efficiency Through Mixture-of-Experts Architecture

ERNIE-4.5-VL-28B-A3B-Thinking leverages a sophisticated mixture-of-experts (MoE) architecture, activating only 3 billion of its 28 billion parameters per task. This selective activation mechanism significantly reduces computational overhead, allowing the model to operate efficiently on a single 80GB GPU-a hardware configuration accessible to many enterprises without requiring costly multi-GPU setups.

The model’s training incorporated advanced multimodal reinforcement learning techniques, including GSPO and IcePop strategies, combined with dynamic difficulty sampling to optimize learning efficiency. This rigorous training regimen, coupled with a vast and diverse dataset of high-quality visual-language reasoning examples, has enhanced the model’s semantic alignment between text and images.

Performance Highlights and Enterprise Potential

Baidu claims that ERNIE-4.5-VL-28B-A3B-Thinking outperforms Google’s Gemini 2.5 Pro and OpenAI’s GPT-5 High on tasks involving document comprehension, chart analysis, and visual reasoning. While independent validation is still awaited, these assertions have sparked considerable interest within the AI community.

Key capabilities include multi-step visual reasoning, causal inference in complex visual contexts, and superior STEM problem-solving by interpreting photographic data. The model also excels in video understanding, demonstrating strong temporal awareness and event localization across video segments.

Its efficiency and open Apache 2.0 license make it particularly attractive for mid-sized companies and startups, enabling deployment without prohibitive infrastructure costs or restrictive licensing terms. This democratization of advanced AI tools could accelerate adoption across industries such as finance, manufacturing, and customer service.

Integration and Developer Support

Baidu supports seamless integration with popular AI frameworks like Hugging Face Transformers, ONNX Runtime, and its proprietary PaddlePaddle platform. Developers can implement the model with minimal code, facilitating rapid prototyping and deployment.

For production environments requiring high throughput, Baidu offers vLLM integration and a dedicated inference toolkit that supports quantization techniques to optimize memory usage and speed. These tools enable enterprises to tailor deployments to their specific hardware and performance needs.

Strategic Implications in the Global AI Landscape

This release is part of Baidu’s broader ERNIE 4.5 family, which spans models from a massive 424 billion parameter MoE variant to a compact 0.3 billion parameter dense model. The heterogeneous modality design allows shared parameters across modalities while maintaining dedicated parameters for each, addressing a key challenge in multimodal AI development by preserving and enhancing performance across both visual and textual tasks.

By offering a high-performing, open-source alternative to proprietary models from Western tech giants, Baidu is positioning itself as a formidable contender in the international AI arena. This move signals a shift toward more accessible, enterprise-friendly AI solutions that balance power, efficiency, and openness.

Considerations and Challenges for Enterprise Adoption

Despite its advantages, deploying ERNIE-4.5-VL-28B-A3B-Thinking requires careful consideration. The 80GB GPU memory requirement, while more accessible than some competitors, still represents a significant investment. Organizations lacking existing GPU infrastructure may need to rely on cloud services, which introduces ongoing costs.

The model’s 128K token context window is substantial but may limit processing of extremely long documents or extended video content. Furthermore, details on safety measures, bias mitigation, and robustness against adversarial inputs remain sparse, underscoring the need for thorough internal testing before production use.

Additionally, the mixture-of-experts routing mechanism adds complexity to deployment, necessitating infrastructure capable of efficiently directing inputs to the appropriate subnetworks. The dynamic image zoom feature also requires integration with image manipulation tools to unlock its full potential.

Community Reception and Future Outlook

The AI developer community has greeted the model with enthusiasm tempered by practical requests, such as support for lightweight deployment formats like GGUF and MNN to enable mobile and edge device usage. Many have praised Baidu’s technical innovations while seeking further resources and documentation.

Baidu plans to present more comprehensive insights into the ERNIE series at its upcoming AI conference, where it is expected to share performance validations and roadmap details.

As enterprises increasingly seek versatile, cost-effective AI solutions for visual understanding and reasoning, Baidu’s ERNIE-4.5-VL-28B-A3B-Thinking offers a compelling option. Its combination of efficiency, open licensing, and advanced capabilities could reshape the competitive landscape and accelerate AI adoption across diverse sectors.

In the words of one developer: “Open source with commercial freedom is a game-changer. Baidu is clearly serious about leading the next wave of AI innovation.”

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

Revolutionizing Visual Reasoning with Human-Like Image Interaction

Efficiency Through Mixture-of-Experts Architecture

Performance Highlights and Enterprise Potential

Integration and Developer Support

Strategic Implications in the Global AI Landscape

Considerations and Challenges for Enterprise Adoption

Community Reception and Future Outlook

RELATED ARTICLES

This AI finds simple rules where humans see only chaos

This tiny chip could change the future of quantum computing

AI may not need massive training data after all