Machine Learning (ML) applications are built on data: large amounts of data are streamed into ML systems in order to train models on historical examples and make high-quality predictions in real-time. Yet many machine learning projects degrade or fail, in large part because data integrity is difficult to maintain.
Why data breaks
Maintaining data integrity is a hard problem. The dynamic nature of ML data is a common reason for model decay. ML systems are increasingly driven by complex feature pipelines and automated workflows that involve data change. Data often goes through multiple transformations to shape it into the right form for the model to consume. These transformations need to be consistently applied to all data of the same kind along the ML pipeline. Furthermore, models usually consume data from multiple pipelines that may be managed by different teams, and expose very different interfaces. For example, ML models often need to combine data from both historical batch sources and real-time streams. With so many moving parts including data and model versions, it’s common for ML models in production to see data inconsistencies and errors.
ML systems will serve predictions for these malformed inputs not realizing that they have data issues. Without additional monitoring, these errors tend to go undetected and eat away at the model’s performance over time. Even worse, since machine learning applications are often treated as a black box, these errors might go unresolved after detection if the immediate impact is low. Unlike other types of software, ML applications lack a comprehensive solution that puts the right processes and monitoring in place.
What is ML Data Integrity
While data integrity can have different definitions depending on who you’re talking to, the most commonly accepted version is simply that the data is consistent and free of inaccuracies throughout its lifecycle.
Data IntegrityThe data used in machine learning is consistent and free of inaccuracies throughout its lifecycle.
Bad data can have a significant impact on the performance of ML models whether in training or in production. Data engineers dedicate a lot of time in feature engineering to ensure that a training set has good data quality to train the best representative model possible. During feature engineering, data engineers replace bad data with good data using a consistent set of rules: dropping rows with missing values, allowing missing values, making missing categorical value as a unique value, imputing missing values, replacing missing values with a statistical representation (e.g. mean) or defaulting to some value.
However, not all ML teams approach deployed models with the same level of diligence even though they run into the same challenges. Once an ML model is trained, it is indifferent to bad data and will keep making predictions, in this case, bad predictions, regardless of data quality.
ML models in production face three types of data integrity problems:
Missing value where a feature input is null or unavailable at inference time.
Range violation where a feature input is either out of expected bounds or is a known error.
Type mismatch where inputs of a different data type are passed in.
There are several ways to handle bad production data but many can create their own problems
Discard inference requests - If the data is bad, the serving system can skip the prediction to avoid erroring out or making an inaccurate prediction. While this can be a solution when the model makes a large number of non-critical decisions (e.g. product recommendation), it’s not an option when it makes business or life-critical decisions (e.g. healthcare). In those cases, there needs to be a backup decision-making system to ensure an outcome. However, these backup systems can further complicate the solution.
Impute or predict missing values - When a value is missing, it can be replaced with an estimated value with either a simple statistical metric like mean of the feature or a more complex predicted value based on other model inputs. A key challenge of this approach is that it hides the problems behind the data issue. Consistently replacing bad data can shift the expected feature’s distribution (aka data drift) causing the model to degrade. A drift as a result of this data replacement could be very difficult to catch, impacting the model’s performance slowly over time.
Set default values - When the value is out of range, it can be replaced by a known high or low or unique value, e.g. replacing a very high or low age with the closest known minimum or maximum value. This can also cause gradual drift over time impacting performance.
Acquire missing data - In some critical high value use cases like lending, ML teams also have the option to acquire the missing data to fill the gap. This is not typical for the vast majority of use cases.
Do nothing - This is the simplest and likely the best approach to take depending on the criticality of your use case. It allows for bad data to surface upstream or downstream so that the problem behind it can be resolved. It’s likely that most inference engines might throw an error depending on the ML algorithm used to train the model. A prediction made on bad data can show up as an outlier of either the output or the impacted input helping surface the issue.
Given all the data challenges that may occur, you need an early warning system so that you can immediately catch and address these issues.
Examples of data integrity issues
Missing values can occur regularly at ML model inference. Even when missing values are allowed in features, a model can see a lot more missing values than in the training set. An example of a missing value error is an ML model making an inference based on a form input where a previously optional field is now always sending a null value input due to a code error.
Range violation happens when the model input exceeds the expected range of its values. It is quite common for categorical inputs to have typos and cardinality mismatches to cause this problem, e.g. free form typing for categories and numerical fields like age, etc. An unknown product SKU, an incorrect country, and inconsistency in categorical values due to pipeline state are all examples of range violation.
Type mismatch arises when the model input type is different from the one provided at inference time. One way types get mismatched is when column order gets misaligned during some data wrangling operations.
How to catch Data Integrity issues
Setting data checks is cumbersome: While data checks might exist in the ML pipeline and in invocation code, a thorough approach involves having checks on model inference to catch any issues at run time. But these missing value, type mismatch, or range checks can be tedious to add as features grow. A quick way to generate these checks is to use a representative data sample from the training set and set up a job that assesses data against these rules regularly and notifies with quick alerts on violations.
How to assess and mitigate Data Integrity issues
When data failures are caught (many go undetected for a few weeks or sometimes longer), it’s important for teams to prioritize fixes by understanding which data violations have the most impact on the model performance. Not resolving issues can have unintended consequences, especially given the brittle nature of ML models.
So how are data issues assessed?
Analyze locally - For critical use cases the best practice is to begin with a fine grained approach of prediction analysis by replaying the inference with the issue and seeing its impact on the model. For this analysis, Explainable AI helps understand the impact of the data violations so it can be root caused quickly - especially in the context of increasingly black box models. However, it can be time consuming to recreate all of the factors that led to an issue, especially if the data or the model have since changed. The data in question may not have been stored with the input query and might need to be recreated. It can also be hard to reproduce the results if the model has not been properly versioned.
Analyze globally - For global issues, the troubleshooting scope expands to understand the severity of the data issue. This involves analysing the data for that feature over a broader range of time to see when the issue might have begun. Data changes typically coincide with product releases. So querying for data change timeline can tie the issue to a specific code and data release helping revert or address it quickly.
Data issues show up as a drift in the model input and, depending on its impact, a corresponding drift in the output. Drift analysis is therefore a useful approach to identify the cause of the data integrity issue. This is particularly relevant when data is imputed as a result of an integrity issue - in this case, the composition of the input data will shift even though it might not trigger integrity violations
These steps typically help to pinpoint and assess the data issue in the pipeline. Given how involved this troubleshooting process can be, ML teams struggle to quickly address data integrity problems without a good MLOps solution.
In summary, data integrity is an essential component for success with any ML application, and MLOps Monitoring can help with the pain points of setting up the right checks, detecting anomalies in the data, and prioritizing the failures with the biggest impact.