Demystifying the Modern AI Stack
The rise of artificial intelligence (AI) means that it is more important than ever for developers and engineers to deploy AI projects more quickly and at greater scale across an organization. At the same time, there has been a boom of AI tools and services designed for different purposes, which has made it challenging to evaluate all of them in the quickly evolving environment.
To achieve this fast and efficient deployment of AI projects, it is crucial for your organization to possess the “modern AI stack,” which is a collection of tools, services, and processes implemented with MLOps practices.
The modern AI stack enables developers and operations teams to construct machine learning (ML) pipelines efficiently, improving utilization, end-user experience, team collaboration, maintenance activities, and more.
>>> Before reading this piece on the modern AI stack, make sure to check out my previous article on “What is MLOps?.”
The modern AI stack can first be broken down into three distinct phases:
Data Management
Model Training/Evaluation
Deployment
Let’s take a look at each one of these phases and what they entail.
Phase 1: Data Management
The first phase of the modern AI stack is data management, which includes data gathering, data transformation, data processing, data versioning, and data monitoring.
Data Gathering
When it comes to data gathering, which is key to having usable data, the process often relies on third-party tools and services that can be integrated into your company’s internal tools.
There are a few key components to data gathering:
Data Collection: Involves web scraping, sifting through databases, and complex queries for extraction. Datasets can also be directly sourced from various third-party services and sites (OpenML, Kaggle, Amazon Datasets).
Data Labeling: After data have been collected, they must be processed and annotated to enable machines to learn from them. The process of data labeling has traditionally been manual, but new tools automate the process and make it easier to scale very quickly. With that said, there are many cases where manual labeling is still required, especially when algorithms are prone to missing specific features.
Synthetic Data: The past few years have seen an explosion of synthetic data, which is crucial when there is not enough available data for specific use cases. There are other cases where synthetic data are preferred, such as use cases that require high levels of privacy and anonymity. There are various tools and libraries for generating synthetic data, and they can be used to generate various types, such as images, text, and tables. Some of the most popular include Tensorflow, OpenCV, and Scikit-learn.
Data Transformation and Storage
Another important aspect of data management is data transformation and storage.
One of the most effective data storage solutions an AI-driven business can embrace is specialized and dedicated cloud storage. AI data is often unstructured, and 80 to 90 percent of total data generated and collected by organizations is unstructured. Unstructured data is not arranged according to a pre-set data model or schema, meaning they cannot be stored in traditional databases.
Unstructured data is highly valuable for AI-driven businesses, containing a wealth of information that can help guide business decisions. But unstructured data have traditionally proven difficult to analyze, until AI and machine learning tools have turned this around.
Data storage requires systems that support a variable volume of data over the long term, and to leverage all of these unstructured data for AI and machine learning, the most successful AI-driven organizations often opt for a cloud-based storage solution like a data lake or data warehouse.
Data Lake: A data lake is a centralized repository where your organization can store all of its structured, semi-structured, and unstructured data in their current form. This enables various types of AI analysis, such as dashboards, visualizations, real-time analytics, and machine learning.
Data Warehouse: A data warehouse stores processed and structured data to support specific business intelligence and analytics needs. Data from a data warehouse is ready to be used to support historical analysis and reporting to improve your organization’s decision making.
Data transformation is usually carried out through the ETL method (Extract, Transform, Load). The ETL method is especially useful when processed data is more important than preserving raw data, and it involves loading data to a temporary staging location, processing them, and storing them in the target location.
Some of the most popular ETL tools on the market are offered by big names like Oracle, IBM, Singer, and Pentaho.
Data Processing
The third main aspect of data management is data processing, which involves converting raw data into useful data that can be analyzed by the model. Raw inputs are first converted to numbers, vectors, embeddings, and other forms before being used for model consumption.
Data processing can involve a few key steps and processes, and one of the most important for AI-driven businesses is exploratory data analysis (EDA).
EDA is used by data scientists to analyze and investigate data sets and summarize their main characteristics. It enables the manipulation of data sources so data scientists can identify patterns, discover anomalies, test hypotheses, and check assumptions. This set of techniques provides a deep understanding of data set variables and their relationships, and it also helps determine if the statistical techniques being considered for data analysis are the right ones.
EDA tools enable your organization to perform specific statistical functions and techniques, such as clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data with many variables. EDA tools also enable the mapping and understanding of interactions between different data fields, the assessment of relationships between variables in a dataset and the target variable, predictive models that use statistics and data to predict outcomes, and much more.
Top EDA tools on the market include Polymer Search, Pandas Profiling, Trifacta, and IBM Cognos Analytics.
Data Versioning
What happens when the data possessed by databases, data warehouses, and data lakes only represent the current state of the world? That’s when it’s time to look at data versioning.
Data versioning is the storage of contrary versions of data that were created or changed over time. By versioning the data, it is possible to save changes to a file or a specific datarow in a database, and the various versions and changes are saved.
A good data version control tool allows you to have unified data sets with a repository of all your experiments, and it enables collaboration between all team members so everyone can track changes in real-time.
The best data versioning tools that can be used to improve the ML workflow include Neptune, DVC, Git LFS, and Pachyderm.
Data Monitoring
The last aspect of the data management phase is data monitoring, which ensures bad data does not pass through your models. Because it is extremely time consuming and resource intensive to maintain the quality of large-scale data, many organizations turn to automated monitoring as a top MLOps practice.
Automated monitoring tools can ensure quality by looking for issues related to missing values, incompatible data types, or data anomalies. There are also traffic monitors that track the volume of both incoming and outgoing data.
Top data monitoring tools on the market include Censius, Fiddler, Grafana, and DataDog. Major cloud platforms, such as Azure and AWS, also include monitoring capabilities.
Phase 2: Model Training/Evaluation
The second phase of the modern AI stack involves model training and evaluation, which is closely related to the data management stage as there is often a back-and-forth to achieve the best results. With that said, model building begins once the main aspects of the data management stage have been carried out, such as collection, storage, analysis, and transformation.
Selection of Algorithms
Perhaps the most important aspect of model building that your business will have to focus on is the selection of the algorithms your AI/ML/DL will be using for your classification or prediction applications. But there are various other important aspects like computation, security, and environment, which often play a major role in the selection of the algorithms.
Different machine learning algorithms look for different trends in patterns, meaning one algorithm might not be the best choice across all use cases. Because of this, it is important to conduct various experiments, evaluate machine learning algorithms, and tune their hyperparameters to find the best solution.
Since machine learning algorithms learn from examples, the more good data you have and the more examples you provide, the better the model is at identifying patterns. With that said, you must be careful with overfitting, which happens when a model can accurately make predictions for data that it was trained on but is unable to generalize to other data.
Overfitting is a major pitfall with AI systems. When ML algorithms are created, they rely on a sample dataset to train the model. But if the model trains for too long on this sample data, or if the model is too complex, it risks learning irrelevant information, or “noise” within the dataset.
The risk is that the model memorizes the irrelevant information and becomes unable to generalize on new data, at which point it becomes overfitted, in other words, too specific to the training set. In this unfortunate case, the model cannot perform the tasks it was designed to carry out. There are a few key indicators of overfitting, with the most prominent being low error rates and a high variance.
Before we get into ways to avoid overfitting, you should be aware of the opposite problem of underfitting. This problem occurs when the training process is paused too early or too many important features are excluded when attempting to prevent overfitting. Underfitting means the model has not trained for enough time or the input variables are not significant enough, which leads to the inability to determine a meaningful relationship between the input and output variables.
Now back to the issue of overfitting…
To avoid overfitting, you need to split your data into training data, validation data, and test data.
IBM provides a good set of definitions for these data:
Training Data: “This data is used to train the model and to fit the model parameters. It accounts for the largest proportion of data because you want the model to see as many examples as possible.”
Validation Data: “This data is used to fit hyperparameters and for feature selection. Although the model never sees this data during training, by selecting particular features or hyperparameters based on this data, you introduce bias and risk overfitting.”
Test Data: “This data is used to evaluate and compare your tuned models. Because this data wasn’t seen during training or tuning, it can provide insight into whether your models generalize well to unseen data.”
It’s crucial that you choose your algorithms based on your use case, which helps narrow the scope.
Some of the most popular methods and algorithms include:
Regression: Supervised ML techniques that predict continuous numerical values and require labeled training examples.
Classification: Supervised ML techniques that predict which category the input data belongs to.
Clustering: Unsupervised ML techniques that divide data into groups where points in the groups possess similar traits.
Recommendation Engines: Can predict a preference or rating to indicate a user’s interest in an item or product by identifying similarities between the users, the items, or both.
Anomaly Detection: Technique that identifies unusual events or patterns, with these items being identified as “anomalies” or “outliers.”
There are various ML libraries that offer advantages in terms of customization, flexibility, speed, community support, and more. After choosing a library, you can begin model-building activities like selection and tuning. The top ML libraries include TensorFlow, scikit-learn, Keras, and PyTorch. If you are looking at natural language processing (NLP) applications, check out my article on “NLP: Python Tools and Libraries.”
Experiment Tracking and Performance Evaluation
Machine learning is a complicated process, which is why it requires multiple experiments involving data, models, feature combinations, and resources. These experiments need to be reproducible so the top results can be re-traced and deployed.
Several tools support experiment tracking and metadata logging to help you build and maintain reproducible experiments, and many of these tools offer collaborative ML to help scale and ensure collaboration between ML teams and projects.
The top experiment tracking tools on the market include MLFlow, Neptune, and Weights & Biases.
When analyzing performance, you need to compare and monitor results across experiments and data segments. Monitoring tools make this process easier as they can monitor multiple experiments and comparisons, automating the process and triggering when pre-configured conditions are met.
The best tools offer support for both standard and customized metrics, which is important as different use cases rely on specific custom-defined indications.
The top monitoring tools on the market include Comet, Census, and Evidently AI.
Phase 3: Deployment
The last main phase of the modern AI stack is deployment, which is one of the most critical aspects of the entire ML lifecycle. After you have trained, tuned, and evaluated your model, it is ready to be deployed into production. By automating a portion of the deployment pipeline, small but important details can be addressed at scale.
Model Serving
The developed machine learning solution needs to be hosted either on premise, in the public cloud, or in the private cloud. Whichever one you decide to go with, it is important that the solution is easily accessible by integrated applications and end-users. Businesses cannot offer AI products to a large user base unless the product is accessible.
Model serving is not the easiest task, and it can have a big monetary impact on business operations. But there are many great tools on the market that deploy machine-learning models in secure environments at scale. They allow API management and scalability, multi-model serving, cloud accessibility, collaboration, and more.
Some of the best model serving tools are TensorFlow Serving, Amazon’s ML API, Amazon SageMaker, and Azure Machine Learning.
Virtual Machines and Containers
Easily-managed model training, model selection, and the deployment phases require isolated environments and experiments, which can be achieved through virtual machines and containers:
Containers: Containerization refers to the act of isolating the environment to ensure clean experiments that don’t negatively impact or alter other experiments. It also enables operations like A/B testing and developers to manage development and deployment activities. Some of the best containers for deploying independent microenvironments include Kubernetes and Docker, while automation tools like Kubeflow and Flytehelp optimize and manage workflows.
Virtual Machines (VMs): The main difference separating virtual machines from containers is that the former allows for virtualization of all layers of the ML pipeline, such as the hardware laters. On the other hand, containers only work with the software layers. Virtual machines are often tuned to when there is the need to run multiple applications with different OS requirements. Some of the world’s top tech names offer virtual machine services, such as Google, Oracle, and Azure. There are also more complex compute solutions like VM Scale Sets and Functions on Azure and their equivalent on AWS and other cloud systems, which enables you to create and manage a group of load balanced VMs. These more complex solutions offer a myriad of benefits, such as making it easier to create and manage multiple VMs, allowing your application to automatically scale as resource demand changes, and enabling work at large-scale.
Model Monitoring
Once your models are deployed to production, you will need to begin looking for model monitoring tools. In the case that things stop working, and you have no model monitoring set up, you have no insight into what is wrong and where to look for problems and solutions.
Depending on what you want to monitor, the needs will be different. However, there are some constant things that you should consider when looking for a model monitoring tool, such as ease of integration, flexibility and expressiveness, overhead, monitoring functionality, and altering.
There are many great model monitoring tools on the market, such as Arize, WhyLabs, and Evidently AI.
One of the top model monitoring solutions is offered by Neptune, which lets you do things like monitor your training and validation losses and take a look at the GPU consumption. With Neptune, you can log and display nearly any ML metadata like metrics and losses, prediction images, hardware metrics, and interactive visualizations. All of this is achieved through the solution’s flexible metadata structure, dashboard building capabilities, metrics comparison, and 25+ integrations with ML ecosystem tools.
On a more general note, monitoring ML models is used for model training, evaluation, and testing, as well as hardware metrics display. But it can also be used to log performance metrics from production jobs or view metadata from ML CI/CD pipelines.
Bringing the Stack Together
The modern AI stack is key to any successful AI-driven organization. It can seem overwhelming at first, but don’t let it intimate you. Resources like this article, and a well-rounded AI team, can help you demystify the stack and break down each phase. Once you have a deep understanding of each phase and process, you can bring all of them together to equip your organization with a set of solutions that open up brand new opportunities for the company.
One important note is that while you are reading this guide to the modern AI stack today, it will likely be different tomorrow. There are a plethora of tools available on the market, and as each day passes, more come down the pipeline.
When considering a new tool for your organization’s AI stack, it should meet a few requirements:
The tool should be easy to adopt and use.
The tool should be customizable and enable additional features.
The investment in the tool should be lower than the cost of building its functionality from scratch.
If a tool doesn’t meet all of these requirements, then it might lead to more problems rather than solutions. The key is to be patient when building your AI stack and look for cost-effective and scalable solutions. If you do that, and your team has a deep understanding of the different phases of the AI stack, you set your organization up with the power to leverage data with AI.
From this article, it should be evident that building an enterprise AI system requires much more than just AI algorithms and data, and encompasses the construction of data pipelines, services, APIs, cloud infrastructure, and telemetry systems that require specialized knowledge and skills. The mastery of the AI stack is key to achieving an organization’s business goals with AI, so being proficient in the issues surrounding AI systems is critical to one’s success.
>>> Keep a lookout for the next part of this series where I’ll explore the human component of the modern AI stack.
Follow on Twitter, LinkedIn, and Instagram for AI-related content.