The article is part VentureBeat’s “The Real Cost Of AI: Performance Efficiency And ROI At Scale” special issue. Read more.
The advent and use of large language model (LLMs), has allowed enterprises to visualize the types of projects they could undertake. This has led to an increase in pilot programs that are now moving to deployment. As these projects gained momentum however, enterprises realized the LLMs that they had previously used were unwieldy, and worse, expensive.
Small language models and distillation. Models like Google‘s Gemma Family Microsoft‘S non -“”https://mistral.ai/” ” rel=””noreferrer noopener”” target=””_blank” “> Small 3.1 by Mistralallows businesses to select fast, accurate models for specific tasks. Enterprises can choose a smaller model to suit specific use cases. This allows them to reduce the cost of running AI applications, and possibly achieve a higher return on investment.
Karthik Ramgopal, distinguished engineer at linkedin told VentureBeat why companies choose smaller models.
Ramgoapl stated that smaller models require less computing, memory, and faster inference time, which translates into lower infrastructure OPEX and CAPEX, given GPU costs, availability, and power requirements. “Task specific models have a smaller scope, which makes their behavior more aligned over time and maintainable without complex prompt engineering.” Openai’s O4-Mini is $1.1 per million input tokens and $4.4/million output tokens. This compares to the full version of o3 which costs $10 for inputs but $40 for outputs.
Enterprises have more options today, including small models, task-specific and distilled models. Most flagship models today come in a variety of sizes. For example, the Claude models from Anthropic includes Claude Opus (the largest model), Claude Sonnet (the all-purpose version), and Claude Haiku (the smallest model). These models are small enough to be operated on portable devices such as laptops and mobile phones.
The savings question
Whenever we discuss return on investment (ROI), the question is: What does ROI look and feel like? Should it be the return on costs incurred, or the time saved that will ultimately result in dollars saved? Experts VentureBeat consulted said that ROI can be hard to judge. Some companies believe they have already achieved ROI by reducing the time spent on a particular task, while others wait to see actual dollars saved or increased business to determine if AI investments are actually working.
Normally enterprises calculate ROI using a simple formula, as described by Ravi Naarla, chief technologist at Cognizant ( ) in a post: ROI=(Benefits-Cost)/Costs. The benefits of AI programs are not always immediately apparent. He advises enterprises to identify the benefits they hope to achieve, estimate them based on historical data and be realistic about the cost of AI including hiring, implementation and maintaining.
Small models are said to reduce maintenance and implementation costs, particularly when fine-tuning the models to provide more context for your business.
Arijit Sengupta is the founder and CEO of Aiblesaid that the context people add to models determines how much they can save. Individuals who need additional context, such as complex and lengthy instructions, can incur higher token costs.
You have to give models some context in one way or another; there’s no free lunch. He said that with large models this is usually done through the prompt. “Think about fine-tuning or post-training to give models context.” I might incur $100 in post-training costs but it is not astronomical.
Sengupta stated that they have seen cost reductions of up to 100X just by post-training. This includes the software operating expenses as well as the ongoing cost for the vector and model databases.
If you use human experts to do the maintenance, it will be expensive because small models must be post-trained in order to produce results that are comparable to large ones.
Experiments Aible conducted demonstrated that a task specific, fine-tuned, model performs well in some use cases. This is similar to LLMs. It also shows that using several models for different use cases rather than a large one to do everything can be more cost-effective.
The company compared a post-trained version of Llama-3.3-70B-Instruct to a smaller 8B parameter option of the same model. The 70B model was 84% accurate for automated evaluations, and 92% accurate for manual evaluations. Once fine-tuned at a cost $4.58, 8B model achieved an accuracy of 82% in manual assessment. This would be suitable for minor, more targeted uses cases.
Right-sizing models doesn’t have to be at the expense of performance. Today, organizations know that choosing a model is not just about choosing between GPT-4o and Llama 3.1. It’s also about knowing that certain use cases, such as summarization or generating code, are better served by small models.
Daniel Hoske is the chief technology officer of contact center AI products provider Crestsaid that starting development with LLMs gives a better idea of potential cost savings.
He said, “You should begin with the largest model to determine if the concept you have in mind will work. If it doesn’t, then it won’t with smaller models.” Ramgopal, from LinkedIn, said that the company follows a similar approach because prototyping allows these issues to be identified.
“Our typical approach for agentic use cases begins with general-purpose LLMs as their broad generalizationability allows us to rapidly prototype, validate hypotheses and assess product-market fit,” LinkedIn’s Ramgopal said. As the product matures, and we encounter limitations around quality, latency or cost, we transition to customized solutions.
During the experimentation phase organizations can determine the most important aspects of their AI applications. This allows developers to better plan what they want to cut and choose the model size which best suits their budget and purpose.
The expert cautioned that, while it’s important to build models that work with what they are developing, high parameter LLMs will always cost more. Large models will always need significant computing power.
Overusing small and task-specific model poses problems. Rahul Pathak is the vice president of data, AI GTM and at AWSsaid in a recent blog post that the key to cost optimization is not using a low-power model, but matching a model’s capabilities with tasks. Smaller models might not have enough context to understand complex instructions. This could lead to an increased workload for employees and higher costs.
Sengupta warned that some distilled model could be brittle and long-term usage may not result in cost savings.
Industry players stressed the importance of flexibility in addressing any potential issues or to adapt to new use cases, regardless of the size of the model. If they choose a model that is large and then a smaller one with better performance or lower cost, the organization cannot be too precious about it. Tessa Burg is the CTO and head innovation at Mod Opa brand marketing company. She told VentureBeat organizations must accept that what they build today will be replaced by a newer version.
“noreferrer noopener”We started out with the mindset that whatever tech underlies the workflows we create, the processes we make more efficient, will change. We knew that whichever model we used would be the worst version. She said that time saved leads to budget savings in the long run. She said it’s a great idea to separate out heavy-weight models for high-cost and high-frequency uses.
Sengupta noted vendors are now making switching between models easier, but cautioned that users should find platforms that allow fine-tuning so they do not incur additional costs.

