We talked about the reasons that you need to monitor in machine learning in our last post. We are now clear about the main factors that can degrade the performance of a model.
So we can define monitoring as the phase of Machine Learning Operations in which we measure different performance variables of the model and compare them with reference values to determine if it continues to generate adequate predictions or if it is necessary to take actions to improve performance.
And there are several ways to perform this monitoring, some quite simple and others more sophisticated.
Monitoring through global metrics
The simplest of all is to continuously record a global metric of the model’s performance and compare it to a reference level.
For example, if we have a face detection system that at the development stage had an accuracy of 97% then we can periodically (e.g. daily) record this performance on the deployed model and if it is observed to fall below this reference level an alert could be generated indicating that we should take some action before things continue to get worse.
The drawback of monitoring using a global performance metric is that we cannot determine the reasons behind the degradation, i.e. whether the underlying problem is “data drift” or “concept drift”.
Monitoring through statistical methods
A more sophisticated way to perform monitoring is for example to obtain the statistical distribution of the input data before deployment and periodically calculate this distribution but for the data used by the deployed model, and then apply a statistical test to determine if there are significant differences between one and the other. In the case of finding differences we could conclude that the origin of the degradation is in “data drift”.
Monitoring in detail
Perfect, we already have clear the main factors that can degrade the performance of a model.
So we can define monitoring as the phase of Machine Learning Operations in which we measure different performance variables of the model and compare them with reference values to determine if it continues to generate adequate predictions or if it is necessary to take actions to improve performance.
And there are several ways to perform this monitoring, some quite simple and others more sophisticated.
Monitoring through global metrics
The simplest of all is to continuously record a global metric of the model’s performance and compare it to a reference level.
For example, if we have a face detection system that at the development stage had an accuracy of 97% then we can periodically (e.g. daily) record this performance on the deployed model and if it is observed to fall below this reference level an alert could be generated indicating that we should take some action before things continue to get worse.
The drawback of monitoring using a global performance metric is that we cannot determine the reasons behind the degradation, i.e. whether the underlying problem is “data drift” or “concept drift”.
Monitoring through statistical methods
A more sophisticated way to perform monitoring is for example to obtain the statistical distribution of the input data before deployment and periodically calculate this distribution but for the data used by the deployed model, and then apply a statistical test to determine if there are significant differences between one and the other. In the case of finding differences we could conclude that the origin of the degradation is in “data drift”.
Or we can do something similar but for the data distributions at the model output before and after deployment, so that if we find statistically significant differences we can conclude that the performance degradation is in this case due to “concept drift”.
Conclusion
Very well, in this article we have seen that after deployment it is very likely that the performance of the model begins to decline and this is precisely because both the data and the environment in which the model is in are dynamic and can continuously present variations.
So monitoring allows detecting this performance degradation, either by analyzing global metrics or by using more advanced techniques such as the use of statistical tests applied to the model’s input or output data.
But this process does not end with monitoring, because if performance degradation is confirmed, corrective actions must be taken to keep the model in production. This phase is known as model maintenance and will be discussed in a future article.