Data is the New Source Code

Data is the New Source Code

The role of data in today’s business world cannot be overstated. Competitive intelligence is inextricably linked to the speed at which valuable data can be consumed and analyzed to yield important business insights. The need for the increased efficiency gleaned by these learning systems has facilitated a massive increase in spending on artificial intelligence (AI) technologies, which in turn define an organization’s competitive advantage. Artificial intelligence is strategically important for driving enterprise strategies and every day, new examples of problems being solved by what is collectively called “Artificial Intelligence” are exposed.

As hugely impactful as AI technologies are, industry leaders often rush to capitalize on the latest trends, lacking a fundamental understanding of what AI really is, beginning with the term itself. There is a lot of confusion between artificial intelligence (AI) and machine learning (ML). Many business leaders still refer to AI and ML as equivalent terms and utilize them reciprocally, while others utilize them as discrete, parallel advancements, which is incorrect.

Artificial Intelligence (AI) refers to giving a computer intelligence that (typically) replicates biological intelligence, the majority of which remain in research labs and online open source project repos. Artificial Intelligence is the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. Algorithms that reduce human coding and solve industry is often the intent behind an “Artificial Intelligence Strategy”, but this actually falls into the area of Machine Learning (ML). Machine learning is the study of computer algorithms that improve automatically through experience.

Machine Learning is at the heart of most organization’s AI endeavors and forms the skeleton of an AI architecture. A machine learning platform includes algorithms, development tools, APIs, model deployment and more. When computers are given the opportunity to learn without explicit programming, it is an application of AI that provide systems the ability to automatically learn and improve from experience.

Machine learning uses two types of techniques: supervised learning, which trains a model on known input and output data so that it can predict future outputs, and unsupervised learning, which finds hidden patterns or intrinsic structures in input data.

Supervised machine learning builds a model that makes predictions based on evidence in the presence of uncertainty. A supervised learning algorithm takes a known set of input data and known responses to the data (output) and trains a model to generate reasonable predictions for the response to new data.

Unsupervised learning is used to draw inferences from datasets consisting of training data without labeled responses. Clustering is the most common unsupervised learning technique. It is used for exploratory data analysis to find hidden patterns or groupings in data.

The end goal of all AI technologies is to grow autonomous capability by learning from enormous datasets. In many ways, data is the new source code.

By all indications, the Artificial Intelligence (AI) industry is on an upward trajectory, with no ceiling in sight, but limiting factors such as data storage and networking bottlenecks must be addressed to assure the maximum benefit from these technologies.

Most of the industry drills down on the computing aspect of training and inference, but not a lot on memory and storage, but the reality is, the best AI and ML solutions have the right combination of computing, memory, and storage. Effective machine learning must enable and streamline the entire workflow, and since they are non-linear, i.e. not a process that starts and then ends, and then goes onto the next iteration, the operations in the workflow must happen concurrently and continuously, meaning parallel storage architecture is paramount. The strength of ML is allowing infrastructure to grow seamlessly as the data sets grow. Data ingest, training, validation, and inference all happen concurrently and continuously. As data continues to be collected, training occurs, and models are cycled through production.

Advances in computing power, the sheer volume of data that is now available online and improved artificial intelligence algorithms have finally made AI practical. The primary components of AI, have extraordinary data consumption with limitless combinations of modifications to data parameters and samples in data sets. These applications pose exceptional challenges and put significant strain on computing, storage, and network resources.

AI applications are dependent on source data; you must know where the source data resides and how the application uses it in order to properly estimate your storage needs. An equally important consideration for artificial intelligence data storage is the volume of data that the application will produce. One of the defining characteristics of AI is that applications can make better decisions as they are exposed to more data. The application’s database will grow over time, so you must monitor how quickly it grows and perform capacity planning accordingly.

Data is not just a component of AI; rather, the two are intertwined. In order to employ a successful AI solution, understanding the connection between the data you’re compiling and the problem you are striving to resolve is key. With the help of storage solutions fully optimized for AI and ML training, industry leaders are able to focus their complete attention on what really matters most—transforming valuable data assets into important insights with unparalleled velocity and accuracy.


by Callie Guenther | CyberSOC Data Scientist, CRITICALSTART
February 19, 2019