It is not enough to throw compute and storage resources into AI workloads. You need the processing power and storage to provide the data at the right rate, but it’s also important to ensure that the data used for AI training is of high quality.
This is the message of Par Botes, Vice-President of AI Infrastructure at Pure Storage. We caught up with him last week at the company’s Accelerate event, held in Las Vegas.
Botes stressed the importance of enterprises tackling AI capturing, organising, preparing and aligning data. This is because data can be incomplete or not relevant to the questions AI is trying to answer.
Botes spoke to us about data engineering, management, data lakehouses, and ensuring datasets are appropriate for the AI needs.
According to Pure Storage, what are the most important upcoming or emerging challenges in AI storage?
It’s difficult to create AI systems that can solve problems without a good way to organise data, capture data, and then prepare it, align it, and align it to the processing components, the GPUs [graphics processing units]that allow them to access data quickly.
What makes these challenges so difficult?
Let’s start with the obvious: how can I get GPUs consume the data? The GPUs are extremely powerful and consume a huge amount of bandwidth.
It is difficult to feed GPUs at the rate we consume data. This is becoming increasingly solved, especially at the high-end. For a typical enterprise, they will have to learn new skills and implement new systems.
As your data improves and your insights change, so must your data. Your model must evolve along with the data. This becomes a continual process.
Par Botes Pure Storage
The problem is not on the science side. It’s on the operations side. These aren’t muscles that have been in enterprise for long.
This problem’s next part is: How can I prepare my data for analysis? How do I collect it? How do I know if I have the right data? How do I evaluate it? How do I track it? How can I check the lineage of this model to ensure that it was trained using this set? How can I be sure that it is a complete dataset. This is a very difficult problem.
Does that problem vary depending on the customer or workload? I can imagine that one could know that an organisation has all the information they need based on the expertise within the organization. In another situation, you might not be sure if you have all the data. It’s hard to say without reasoning about [whether] that you have all the information you need. Let me give you an instance.
We spent many years developing a self-driving vehicle – perception networks and driving systems – but we often found that the car did not perform well in certain conditions.
There were other cars on the road and it turned left. It was slightly uphill. We realised that we didn’t possess enough training data. It’s not common for high-end training companies to use a principled approach to reasoning about data, completeness and range [of data]. Having all the data and analysing it in math is also not very common.
After looking at the issues and difficulties that can arise when AI workloads are used, how would you suggest that customers begin to mitigate these?
I would recommend that you think about your data engineering process. We partner with data engineering firms that create things like lakehouses.
Consider: How can I apply a Lakehouse to my data? How can I use my lakehouse for cleaning and preparing data? In some cases you may even need to transform it so that it is ready for the training system. I will begin by thinking about my data engineering discipline and how I can prepare it to be ready for AI.
If you dig deeper, what is data engineering?
In general, data engineering is about how to get access to other datasets, which can exist in corporate database, structured systems, or other systems that we have. How do I access that? How do I convert that data into an intermediate format that I can use to train my model? How do I transform this and select data from these sets that may be across different repositories in order to create a dataset which represents the data that I want to train on?
This is the discipline that we call data engineering. It’s becoming an increasingly distinct skill, and a distinct discipline.
How do customers support Data Lakehouses with storage? What forms?
What’s common today is that you have cloud companies which provide data lakehouses and system houses for on-prem.
Our company works with several of them. We offer complete solutions, including vendors of data lakehouse. We partner with them.
Then, of course, there is the storage system that makes it work fast and efficiently. The key components are the popular databases of the data lakehouse and the infrastructure underneath them, and then connecting those to other storage systems on the training side.
Is data engineering a one-time challenge or is it a continuous process as organisations tackle AI.
It’s hard to separate data engineering from storage. Although they are not the same, they are closely related.
As soon as you start using AI you will want to record any new data. You want to transform the data and make it a part of your AI system. Whether you’re using RAG [retrieval augmented generation] for fine-tuning or if your are advanced, then you can build your own model.
It’s always going to be improved. As your data improves and your insights change, so must your data. Your model must evolve along with your data.
It becomes a continual process.
There are a few things to consider, like lineage. What is the history of these data? What’s the origin of this data? What is consumed where? You need to consider when people use your product or when you use it internally. What is the question? What’s the next question?
You want to store that data and use it for future training, as well as for quality assurance. This is what we call a data flywheel for AI. The data is continuously ingested and consumed, then computed.
This circle never stops.
Do you have any other suggestions for the customer?
Also, you should think about what this data is, what the data represents. If the data is a representation of something you observe, or something you do, and you have gaps, the AI will fill them in. We call it hallucination when it fills these gaps incorrectly.
It is important to know your data so well that you can identify gaps. Can you fill in the gaps if there are any? Once you reach that level of sophistication you have a system you can use.
Even when you begin with the basics of using a service in the cloud, you should start by recording everything you send and receive. This is the foundation of your data management discipline. When I say data engineering, this discipline is called data management.
You want to begin the organisation of your data as soon as possible. By the time you are ready to do more than just use the service, you will have the first set of data that you can give your data engineers or store.
This is a great insight that I hope everyone will consider.
