MLOps & Machine Learning Pipeline Explained
Lynn Orlando. February 22, 2021
What is MLOps?
MLOps is a compound term that combines “machine learning” and “operations.” The role of MLOps, then, is to provide a communication conduit between data scientists who work with machine learning data and the operations team that manages the project. To do so, MLOps applies the type of cloud-native applications used in DevOps to machine learning (ML) services, specifically continuous integration/continuous deployment (CI/CD). Although both ML and normal cloud-native apps are written in (ok, ~result in) software, there is more to ML services than just code. While cloud-native apps require source version control, automated unit-/load -testing, AB testing, and final deployment, MLOps uses a data pipeline, ML model training, and more complex deployment with special purpose logging-monitoring capabilities. After all that is done, it introduces a new step called continuous training (CT) that needs to be added to the DevOps process, for CI/CD/CT.
Data pipeline-ML model training
One significant difference between DevOps and MLOps is that ML services require data–and lots of it. In order to be suitable for ML model training, most data has to be cleaned, verified, and tagged. Much of this can be done in a stepwise fashion, as a data pipeline, where unclean data enters the pipeline, and then the training, validating, and testing data exits the pipeline.
Data cleaning can be as easy as converting data formats into CSV files, images into different dimensions/resolutions, filtering HTML out of text files, etc. Doing all this may be automatable.
However, the biggest challenge in ML training data is tagging it, i.e., with the prediction/classification/inference for the data. This task often involves some manual process, where someone has to take a look at a picture and classify image(s) into various categories, read a text file to determine its sentiment, or view machine parts on an assembly line video feed and identify defects, etc. Some of this can be automated, but most not.
After you have cleaned, verified, and tagged data, you can start ML model training to classify/predict/infer whatever you want the model to do. The tagged data is split up into training, validation, and hold-out testing datasets. The training and validation data is used multiple times to optimize model architecture/hyper parameters. When that is done, you use the hold-out test data to make one final pass to see if it works well enough on the fresh data that you need to deploy.
Continuous Integration for MLOps
At this point you have to save or checkpoint the whole package, i.e., the original data, the data pipeline automation processes/procedures, the cleaned/tagged training, validation, and hold-out test datasets, all the model runs with results, and the final model architecture/hyper-parameters/neural-net weights. All of this must be saved so that you can reproduce any and all steps that led up to the ML model, if necessary, and then this model can be used as a starting point for the next round of the training (see below).
To proceed to deployment, the model may need to be transformed into something that can run in the target environment. This task may require re-packaging and transforming from one model framework to another (TensorFlow to TFLite) for edge/IoT deployments. Any transformation of the model for target environments also needs to be assessed for service accuracy/loss, and this model must be saved/checkpointed as well. Further, the ML model may need to be transformed to use RESTful API to trigger the ML service prior to deployment.
In addition, there are still more steps needed prior to deployment:
- Training data needs to be mathematically characterized, to create a statistical picture or characterization of the data used in training that can be compared against incoming data after deployment. This same characterization must be performed against incoming data and, as such, must be coded, tested, and deployed along with the ML service. This way one can detect data skew, a situation in which the data used in service differs from the data used in training.
- If the incoming ML service data needs to be transformed/cleaned before model inferencing can take place, that data transformation process needs to be designed, coded, tested, and deployed along with the ML service into some sort of mini-data pipeline to be used as a prelude to ML service requests once in operation
- ML service logging is more involved than typical service deployments. You need to log all incoming data, all predictions made for that data, as well as all service performance metrics. All this logging information needs to be permanently saved. This logging activity must be coded, tested, and deployed along with the ML service.
- ML services still need to be monitored for model skew but also for model drift, a situation in which the model prediction power is altered due to environmental changes. Both of these can be automated in some cases, and any such automatic monitoring will need to be coded, tested, and deployed along with the ML service.
Continuous Deployment for MLOps
ML service unit testing is more complex. ML services need to be tested to see if they can meet whatever service-level agreements (SLAs) are required for the service, such as prediction latency, throughput, dependability, etc. Moreover, the mini-data pipeline, if any, and the logging and the monitoring automation must also be tested and validated for performance requirements.
At this point, you can deploy your ML service in an AB fashion to see if it works better than what you have running already. Once you are confident that it works well enough, it can be automatically deployed.
However, model drift detection can be difficult to automate in many cases. Sometimes you will need to resort to a semi-manual process, where you extract some incoming data, tag it (manually or semi-manually), and check it to determine whether or not the ML service classification is equivalent to your manual tag for the data. With enough of these comparisons, you should be able to calculate ML service in operation loss/accuracy statistics to compare against ML model metrics for model drift. You can also use this manually tagged data to be included in the next training data set.
Model drift can occur due to data skew, but it could just as easily be due to some real-world shift in what’s happening, e.g., fashions change over time, new vocabulary enters mainstream use, new parts may differ from old parts, etc.
Continuous Training in MLOps
Whenever you detect significant data skew or model drift, it’s time to retrain. Essentially, this involves going through the whole data pipeline-ML model training pass, which some organizations call continuous training (CT), followed by another iteration of CI/CD for your next ML model version.
CT can be triggered anytime you detect data skew in incoming data, when you detect model drift, when customer satisfaction starts dropping, or on a periodic basis as a proactive measure.
In addition, new ML technology is coming out all the time. Taking advantage of the latest ML technology can often lead to substantial improvements in ML model and service loss/accuracy statistics.
Some organizations automate the CT process, essentially triggering CT on a periodic basis, or when model drift/data skew/customer experience change is detected, as mentioned above, but retraining depends upon properly tagged data, and being able to automatically tag data can be difficult. Nevertheless, if -sampling and tagging incoming data is done (manually) anyway, then new tagged data is always available for retraining when it’s needed.
Welcome to MLOps.
Additional Helpful Resources
Accelerating Machine Learning for Financial Services
GPU for AI/ML/DL
AI Storage Solution
10 Things to Know When Starting with AI
How to Rethink Storage for AI Workloads
FSx for Lustre
5 Reasons Why IBM Spectrum Scale is Not Suitable for AI Workloads
Gorilla Guide to The AI Revolution: For Those Who Are Solving Big Problems
Kubernetes for AI/ML Pipelines using GPUs