Meta has announced a partnership today with Cerebras Systems will power its new Call API () offers developers access to inference speed up to 18 times faster compared to traditional GPU-based solutions.
The announcement was made at Meta’s inaugural LlamaCon developer’s conference in Menlo Park positions the company directly to compete with Openai Anthropic Google is a leader in the rapidly expanding AI inference market, where developers buy tokens in the billions for their applications.
“Meta selected Cerebras as a collaborator to deliver the ultrafast inference they need to serve their developers through their new Llama api,” said Julie Shin Choi during a Cerebras press briefing. “We are very excited to announce Cerebras’ first CSP hyperscaler partner to deliver ultra-fast computation to all developers.” While Meta’s Llama model has accumulated over One billion downloadsuntil now, the company has not offered a cloud infrastructure to developers so they can build applications.
James Wang, Cerebras’ senior executive, said: “This is exciting, even if we don’t mention Cerebras in particular.” “OpenAI, Anthropic and Google have built a new AI business, the AI inference one, from scratch. Developers building AI apps may buy tokens in the millions or even billions. These are the same as the new compute instructions people need to create AI applications.”
The Cerebras system supercharges Llama model
Meta’s offering is distinguished by the dramatic speed boost provided by Cerebras AI chips. The Cerebras System delivers over According to benchmarks, Llama 4 Scout can generate 2,600 tokens every second compared to 130 tokens for ChatGPT, and 25 tokens for DeepSeek. Artificial Analysis
Wang explained that “if you compare API-to API, Gemini and GPT are all great models but they run at GPU speed, which is about 100 tokens per sec.” “And 100 tokens a second is okay for chat but very slow for reasoning.” It’s slow for agents. People are still struggling with this today.”
The speed advantage allows for new categories of applications, such as real-time agents and conversational low-latency systems, interactive code generators, and instant multistep reasoning, which all require chaining together multiple large language models calls that can be completed in seconds instead of minutes.
Call API marks a significant shift for Meta’s AI Strategy, as it transitions from being a model-provider to a full service AI infrastructure company. Meta creates a revenue stream while maintaining its commitment towards open models by offering an API service.
Wang said during the press conference that Meta is now in business of selling tokens. This is great for the American type of AI ecosystem. “They bring a great deal to the table.”
API will offer tools for fine tuning and evaluation, beginning with Llama model 3.3 8Ballows developers to generate data, train with it, and test their custom models. Meta stresses that it will not use customer data for training its own models. Models built using the Llama api can be transferred to another host–a clear distinction from some competitors’ closed approaches. Cerebras, a network of data centres located in North America, will power Meta’s service. This includes facilities in Dallas, Oklahoma and Minnesota, as well as Montreal and California. Choi stated that “all of our data centers which serve inference are currently in North America.”
“We will serve Meta with Cerebras’ full capacity.” The workload will be evenly distributed across all these data centers.
Choi described the business arrangement as “the classical compute provider to a Hyperscaler” model. This is similar to how Nvidia supplies hardware to major cloud service providers. She said that “they are reserving block of our compute so they can serve their developer community.”
beyond Cerebras Meta has also announced that it is partnering with Groq in order to provide developers with fast inference options. This partnership will give developers high-performance alternatives other than traditional GPU-based GPU inference.
Meta’s entry into the market for inference APIs with superior performance metrics may disrupt the established order. Openai Google AnthropicMeta is positioning itself to be a formidable competitor within the commercial AI space by combining its open-source models and dramatically faster inference abilities. Cerebras’ materials state that Meta is in a unique situation with 3 billion users and hyper-scale datacenters. It also has a large developer ecosystem. The integration of Cerebras’ technology “helps Meta leapfrog OpenAI by approximately 20x in performance.”
This partnership represents a significant milestone and validation of Cerebras’ specialized AI hardware approach. “We’ve been building this wafer scale engine for years and we’ve always known that the technology is first-rate, but it must ultimately end up in someone else’s cloud. Wang said that this was the ultimate goal from a commercial perspective. “We have finally achieved that milestone,” Wang added.
Wang explained that a developer, who may not be familiar with Cerebras, can generate an API key and select the Cerebras option on Meta’s SDK software. “All of a sudden their tokens will be processed on a giant engine on a wafer scale,” Wang said. “That’s just tremendous to be on the backend of Meta’s entire developer ecosystem,” Wang said.
Meta’s choice of specialized Silicon signals something profound: In the next phase AI, it’s more than just what models know. It’s how quickly they think. In the future, speed will be more than just a feature. It will be the entire point.
Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.
Read our privacy policy
Thank you for subscribing. Click here to view more VB Newsletters.
An error occured.