I came across this article in the science and technology section of the Economist over the holidays and it highlighted just how critical data – and more specifically the volume of data – is to the successful outcome of an AI project. For AI projects to be successful, NVIDIA [GPU] performance must be maximized. In the aforementioned case, the Economist team developed and AI training program to replace the human journalists with machine generated content.  While the sentences were structurally correct and the journalistic style was adhered to very precisely, the output was complete garbage.  It turns out that the written word is considerably more difficult than image based training because it includes the dimension of context.  An image is an image, no matter what the context, while a word can have entirely different meaning dependent on context. The challenge with speech and text is that model accuracy is highly dependent on context, and context has to be trained on very large data sets utilizing long short-term memory (LSTM) networks. The Economist will continue to have a job until the AI system has sufficient data to provide an intelligent outcome on its own.

Andrew Ng put it best in his lecture notes (captured below) “It’s not who has the best algorithm that wins, it’s who has the most data”.  Model accuracy is a direct function of the amount of data that gets processed through that training system.  An autonomous vehicle training system is expected to consume between 200 and 500 Petabytes of data[1] to reach a safe driving level. 

The scale of infrastructure required to manage complex learning projects (like driving a car or writing a technology brief that is intelligible) is beyond the limits of traditional storage systems because it pitches two opposing storage architectures together – low latency high performance data storage to data to keep the GPU learning system saturated, and a low cost, high capacity storage tier for data capture and retention.  Traditional architectures such as scale-out NAS don’t work for deep learning because data has been treated as something that had an active life and then at some point was retired to an archive for retention – or even thrown away.  But in the case of long short-term memory networks the old data is the very source of learning. So there has to be a way to allow data to seamlessly move from archive to active and back again while cumulatively growing the training set.

New applications like AI training storage systems must has a seamless way of servicing the lightning fast data (Flash technology) and cold archive (disk based object storage) into one architecture for a successful model outcome.  Click here to learn more about how WekaIO has reimagined data storage for AI systems making it the vendor of choice for GPU intensive compute clusters.

[1] Data source:  Nvidia Blog – Training AI for Self Driving Vehicles:  The Challenge of Scale