In the evolving landscape of artificial intelligence, the most pressing issue today isn’t the expansion of model sizes or the integration of multiple data types-it’s the looming challenge of capacity constraints. At a recent AI Impact event in New York City, Val Bercovici, Chief AI Officer, joined VentureBeat CEO Matt Marshall to explore the complexities of scaling AI amid increasing latency, cloud dependency, and escalating operational expenses.
Understanding the Emerging Cost Dynamics in AI
Bercovici highlighted that AI is approaching a critical economic inflection point reminiscent of the surge pricing model popularized by ride-sharing platforms like Uber. While Uber’s surge pricing introduced dynamic, real-time cost adjustments based on demand, AI is on the cusp of a similar transformation-particularly in inference workloads-where profitability will dictate pricing structures.
“Currently, AI services operate under subsidized pricing models, which have been essential to fuel innovation,” Bercovici explained. “However, with capital expenditures reaching into the trillions and operational energy costs being finite, true market-driven pricing will inevitably emerge, possibly as soon as next year and almost certainly by 2027. This shift will revolutionize the industry, intensifying the imperative for efficiency.”
The Token Economy: Balancing Volume, Accuracy, and Speed
At the heart of AI’s economic model lies the exponential value generated by increasing token usage. “In AI, more tokens translate directly into greater business impact,” Bercovici noted. Yet, sustaining this growth is a formidable challenge. The traditional business triangle of cost, quality, and speed manifests in AI as latency, expense, and output accuracy-where accuracy remains non-negotiable.
This demand for precision is critical not only for consumer-facing applications like conversational agents but also for mission-critical sectors such as pharmaceutical research, financial compliance, and healthcare, where errors can have severe consequences.
“High token counts are essential to achieve the inference accuracy required, especially when incorporating security layers, guardrails, and quality assurance models,” Bercovici said. “This inevitably forces a trade-off between latency and cost. For some consumer applications, higher latency can be tolerated, enabling lower-cost or even free service tiers.”
Latency: The Bottleneck in Agent-Based AI Systems
Latency emerges as a significant constraint, particularly in complex AI agent ecosystems. “Modern AI agents rarely operate in isolation,” Bercovici explained. “Instead, they function as swarms-collaborative groups of agents working in parallel to accomplish complex tasks.”
Within these swarms, a central orchestrator agent-typically the most advanced model-assigns subtasks, manages architectural decisions, and balances execution environments, whether cloud-based or on-premises, while considering performance and security requirements. The swarm then executes these subtasks concurrently, with evaluator models assessing the overall success.
“These swarms engage in multiple iterative cycles, often involving hundreds or thousands of prompt-response exchanges before converging on a solution,” Bercovici said. “Even minor delays compounded over thousands of interactions can render the process impractical, underscoring why latency reduction is paramount. Currently, this necessitates paying premium, often subsidized, prices that must decrease over time.”
Reinforcement Learning: The Frontier of AI Advancement
Until recently, AI agents struggled to perform complex tasks reliably. However, advancements in hardware and expanded context windows have enabled agents to undertake sophisticated activities, such as generating high-quality software code. It is now estimated that up to 90% of software development in some domains is assisted or performed by AI coding agents.
With agents reaching maturity, reinforcement learning has become the focal point for AI researchers at leading institutions like OpenAI, Anthropic, and Gemini. This approach integrates training and inference into a continuous, iterative process, accelerating progress toward artificial general intelligence (AGI).
“Reinforcement learning represents the cutting edge of AI development,” Bercovici stated. “It combines the best practices from both model training and inference, enabling thousands of iterative learning cycles that push the boundaries of what AI can achieve.”
Strategies for Achieving Sustainable AI Profitability
Building a profitable AI infrastructure is not a one-size-fits-all endeavor. Depending on organizational goals and resources, some may opt for fully on-premises solutions-particularly those developing frontier models-while others might prefer cloud-native or hybrid architectures to maintain agility and responsiveness.
“The key metric is unit economics,” Bercovici emphasized. “We are currently experiencing a boom, or arguably a bubble, fueled by subsidized AI economics. However, rising token costs won’t halt usage; instead, organizations will adopt more granular and strategic token consumption.”
Leaders should shift their focus from the cost per token to the economics of entire transactions, where efficiency gains and business impact become clearer. The essential question to ask is: “What is the true cost of my unit economics?”
Ultimately, the future of AI isn’t about limiting usage but about optimizing it-leveraging smarter, more efficient approaches to scale AI capabilities sustainably and profitably.

