News

Forget training, find your killer apps during AI inference

October 30, 2025

Insights from Pure Storage Leaders on AI Inference in Production and the Critical Role of Scalable Storage Solutions

In today’s AI landscape, the majority of enterprises do not engage in training their own artificial intelligence models. Instead, their primary focus lies in deploying AI for production use cases, particularly inference and iterative fine-tuning, with data curation and management forming the backbone of these efforts.

Technologies such as retrieval augmented generation (RAG), vector databases, and the reuse of AI prompts are becoming essential tools. Additionally, AI-powered co-pilot functionalities that enable users to query corporate data using natural language are gaining traction. These insights were shared by Pure Storage executives during their recent accelerator event in London.

Pure Storage’s latest innovations, including the recently introduced Key Value Accelerator, exemplify the company’s commitment to delivering on-demand capabilities that address the evolving needs of AI workloads. These developments highlight the challenges organizations face in the so-called “post-training” phase of AI maturity.

This article delves into the storage requirements that arise during AI production, focusing on continuous data ingestion and inference operations.

Why Investing in GPUs for AI Training May Not Be Wise

Given the high costs and rapid evolution of graphics processing units (GPUs), most organizations opt against building in-house AI training infrastructure. Instead, they prefer leveraging cloud-based GPU resources for training phases. John Colgrove, Pure Storage’s founder and chief visionary officer, emphasizes that attempting to maintain AI training hardware on-premises is impractical, as GPUs quickly become outdated.

Colgrove explains, “Organizations typically plan to depreciate hardware over five to seven years, but GPUs evolve so fast that this model doesn’t apply. Leasing GPU capacity is a smarter approach, akin to leasing a car if you only need it for a short period.”

Discovering the AI Application That Transforms Your Business

For many companies, AI’s value emerges not during model development but when it powers transformative applications. Colgrove illustrates this with a financial services example: instead of relying on traditional batch processing to analyze customer data, the true breakthrough lies in real-time inference integrated directly into customer-facing systems.

He notes, “The killer AI application is one that leverages inference on existing data within operational systems, enabling continuous improvement and innovation.”

Fred Lherault, Pure Storage’s EMEA CTO, summarizes this by stating, “The critical question is how to seamlessly connect AI models with curated, AI-ready data stored in architectures optimized for easy access.”

Essential Technologies Enabling Flexible AI Workflows

Inference has become the focal point for many AI adopters, requiring robust data curation and ongoing model iteration. To support this, organizations must establish flexible data connectivity, incorporating technologies like vector databases, RAG pipelines, co-pilot assistants, and prompt caching mechanisms.

Storage infrastructure plays a pivotal role here. For example, vector databases augment data with searchable vectors, often expanding the dataset size by an order of magnitude. Lherault explains, “If your original dataset is one terabyte, the augmented vector database might require ten terabytes, a scale many organizations are encountering for the first time.”

Strategies to Manage Growing Storage Demands in AI Environments

AI workflows generate substantial storage demands, especially when snapshotting for rollback during iterative processing. Pure Storage’s Evergreen model offers a scalable, as-a-service approach that enables rapid capacity expansion without disruption.

Moreover, Pure Storage’s Key Value Accelerator enhances performance by caching AI-generated responses, allowing large language models to reuse tokens efficiently. Since GPU cache is limited and recalculations are costly, this innovation can accelerate response times by up to 20 times.

Lherault highlights, “When multiple users ask identical questions simultaneously, our solution prevents redundant GPU computations by storing pre-calculated results on storage, optimizing resource utilization.” This approach reduces GPU requirements and significantly speeds up complex query handling.