In this article of the Machine Learning Operations series we will see in detail what monitoring a Machine Learning model consists of.
So let’s get started!
Introduction
In Machine Learning Operations (or MLOps) we seek to take a Machine Learning model from the development stage to the production stage.
In the development stage what we do is train and validate the model, and then take it to the production stage where we deploy it and make it available to the end user.
And it is in this deployment stage where inconveniences can arise that cause the model not to perform as expected.
So in this article we will see what the monitoring of a Machine Learning model consists of, which ultimately allows us to continuously measure the performance of the model, determining when and why it works inadequately, and then take the necessary corrective actions.
What can go wrong after deployment?
In previous articles we have talked about what Machine Learning Operations is and what the deployment of a Machine Learning model consists of, which allows us to take it from the development stage to the production stage by making it available to an end user.
But this is only the beginning of the story, as the challenge is to ensure that the model continues to make good predictions even after it has been trained and deployed.
To understand this let’s consider a hypothetical example: suppose we develop a model for a department store to predict how many products in each category will be needed each week, so that store managers can have enough products in inventory in advance.
We trained, validated and deployed our model and everything works flawlessly. During the first few months of using the model, there is an increase in sales, because there will always be a sufficient amount of products in inventory for the different categories.
But about a year later the numbers start to drop. The demand for some products was higher than predicted by the model and for those categories there were not enough products in the store. And in other cases the opposite happened: for some products the demand was lower than predicted by the model, so there was an excess of those products in inventory.
Ultimately, this model, which at the beginning worked perfectly, degraded over time and instead of generating revenue, it began to generate losses for the store.
And this is precisely an example of what generally happens in practice: deploying a model is not the end of the process, as its performance can degrade over time.
So we must continuously monitor its performance to detect possible failures and take timely corrective actions if necessary.
Why can the model fail?
There are essentially two main groups of situations that can explain performance degradation: software failures and model failures.
Software failures
Software failures are due, as the name implies, to elements of the software used during deployment that do not work as expected.
For example they can be due to certain libraries or packages that were not installed correctly during the deployment or also due to failures in the CPU or GPU of the servers.
But this type of failures is not the one we are interested in, as they depend on external factors and could be solved more by a Software Engineer than by a Machine Learning one.
Model failures
The types of failures we are interested in are those directly related to Machine Learning, which cause the model not to generate good predictions. These failures are more difficult to detect and handle than software failures, but they must be handled so that the model can continue in the production stage.
The most common failures are what are known as distribution changes and can end up affecting virtually any Machine Learning model.
When we trained the model (at the development stage) we did so using a training and test set. Ideally in the production stage the data used should have the same characteristics used in the development stage, but this is almost impossible to control, which makes these slight differences end up affecting the performance of the model.
Among the most common distribution variations we have “data drift” and “concept drift”.
In “data drift” there are changes, slight or significant, in the characteristics or distribution of the input data with respect to those used during training.
Suppose we train a face recognition model but only with images taken during the day. We take it to production and initially it works quite well. But suddenly we start using it to detect faces at night and this is where we start to see a degradation of performance. In this case we have precisely a “data drift” because the training data comes from a different distribution than the data received by the production model.
On the other hand, “concept drift” occurs when the distribution of the input data remains unchanged but nevertheless the predictions made by the model begin to change.
For example, suppose we develop a model to predict the price of a property based on some of its characteristics such as area, number of bedrooms and number of bathrooms.
For a particular property the model predicts a cost of $500,000. But it turns out that some time later there was a real estate crisis and the model, despite using input data with the same distribution used in the training, now predicts the cost to be $200,000.