At Fiddler Labs, we place great emphasis on model explanations being faithful to the model’s behavior. Ideally, feature importance explanations should surface and appropriately quantify all and only those factors that are causally responsible for the prediction. This is especially important if we want explanations to be legally compliant (e.g., GDPR, article 13 section 2f, people have a right to ‘[information about] the existence of automated decision-making, including profiling .. and .. meaningful information about the logic involved’), and actionable. Even when making post-processing explanations human-intelligible, we must preserve faithfulness to the model.
How do we differentiate between features that are correlated with the outcome, and those that cause the outcome? In other words, how do we think about the causality of a feature to a model output, or to a real-world task? Let’s take those one at a time.
Explaining causality in models is hard
When explaining a model prediction, we’d like to quantify the contribution of each (causal) feature to the prediction.
For example, in a credit risk model, we might like to know how important income or zip code is to the prediction.
Note that zip code may be causal to a model's prediction (i.e. changing zip code may change the model prediction) even though it may not be causal to the underlying task (i.e. changing zip code may not change the decision of whether to grant a loan). However, these two things may be related if this model’s output is used in the real-world decision process.
The good news is that since we have input-output access to the model, we can probe it with arbitrary inputs. This allows examining counterfactuals, inputs that are different from those of the prediction being explained. These counterfactuals might be elsewhere in the dataset, or they might not.
Shapley values (a classic result from game theory) offer an elegant, axiomatic approach to quantify feature contributions.
One challenge is they rely on probing with an exponentially large set of counterfactuals, too large to compute. Hence, there are several papers on approximating Shapley values, especially for specific classes of model functions.
However, a more fundamental challenge is that when features are correlated, not all counterfactuals may be realistic. There is no clear consensus on how to address this issue, and existing approaches differ on the exact set of counterfactuals to be considered.
To overcome these challenges, it is tempting to rely on observational data. For instance, using the observed data to define the counterfactuals for applying Shapley values. Or more simply, fitting an interpretable model on it to mimic the main model’s prediction and then explaining the interpretable model in lieu of the main model. But, this can be dangerous.
Consider a credit risk model with features including the applicant’s income and zip code. Say the model internally only relies on the zip code (i.e., it redlines applicants). Explanations based on observational data might reveal that the applicant’s income, by virtue of being correlated to zip code, is as predictive of the model’s output. This may mislead us to explain the model’s output in terms of the applicant’s income. In fact, a naive explanation algorithm will split attributions equally between two perfectly correlated features.
To learn more, we can intervene in features. One counterfactual changing zip code but not income will reveal that zip code causes the model’s prediction to change. A second counterfactual that changes income but not zip code will reveal that income does not. These two together will allow us to conclude that zip code is causal to the model’s prediction, and income is not.
Explaining causality requires the right counterfactuals.
Explaining causality in the real world is harder
Above we outlined a method to try to explain causality in models: study what happens when features change. To do so in the real world, you have to be able to apply interventions. This is commonly called a “randomized controlled trial” (also known as an “A/B testing” when there are two variants, especially in the tech industry). You divide a population into two or more groups randomly, and apply different interventions to each group. The randomization ensures that the only differences among the groups are your intervention. Therefore, you can conclude that your intervention causes the measurable differences in the groups.
The challenge in applying this method to real-world tasks is that not all interventions are feasible. You can’t ethically ask someone to take up smoking. In the real world, you may not be able to get the data you need to properly examine causality.
We can probe models as we wish, but not people.
Natural experiments can provide us an opportunity to examine situations where we would not normally intervene, like in epidemiology and economics. However, these provide us a limited toolkit, leaving many questions in these fields up for debate.
There are proposals for other theories that allow us to use domain knowledge to separate correlation from causation. These are subject to ongoing debate and research.
Now you know why explaining causality in models is hard, and explaining it in the real world is even harder.