This post is a gentle introduction to a white box machine learning model called a GA2M.
In this introduction to GA2Ms we’ll walk through:
The purpose of all these machine learning models is to make a prediction towards a goal specified by a human. Think of a model that can predict loan default, or the presence of someone’s face in a picture.
The short story: A generalized additive model (GAM) is a white box model that is more flexible than logistic regression, but still interpretable. A GA2M is a GAM with interaction terms, which allows it to be more flexible still, but with a more complicated interpretation. GAMs and GA2Ms are an intriguing addition to your toolbox, interpretable at the expense of not fitting every kind of data. A picture:
For more about what that all means, read on.
The term “white box” comes from software engineering. It means software whose internals you can view, compared to a “black box” whose internals you cannot view. By this definition, a neural network could be a white box model if you can see the weights (picture credit):
However, by white box people really mean something they can understand. A white box model is a model whose internals a person can see and reason about. This is subjective, but most people would agree the weights shown above don’t give us information about how the model works in such a way as we could usefully describe it, or predict what the model is going to do in the future.
Compare the picture above to this one about risk of death from pneumonia by age from :
Now that isn’t a whole model. Rather, it’s the impact of one feature (age) on the risk score. The green lines are error bars (±1 standard deviation in 100 rounds of bagging). The red line in the middle of them is the best estimate. In the paper, they observe:
All this from one graph of one model feature. There are facts, like the shape of the graph, and then speculation about why the graph might behave that way. The facts are useful to understand the data. The speculation cannot be answered by any tool, but may be useful to suggest further actions, like collecting new features (say, about retirement) or new instances (like points below age 50 or above 100), or new analyses (like looking carefully at data instances around ages 85-86 for differences).
These aren’t simulations of what the model would do. These are the internals of the model itself, so that graph is accurately describing the exact effect of age on risk score. There are 55 other components to this model, but each can be examined and reasoned about.
This is the power of a white box model.
This example also shows the dangers. By seeing everything, we may believe we understand everything, and speculate wildly or “fix” inappropriately. As always, we have to exercise judgment to use data properly.
In summary: make a white box model to
One final possibility: regulations dictate that you need to fully describe your model. In that case, it could be useful to have human-readable internals for reference.
Here are some examples of white box and black box models:
White box modelsBlack box modelsLogistic regression
Decision trees (short and few trees)Neural networks (including deep learning)
Boosted trees and random forests (many trees)
Support vector machines
Now let’s walk through three specific white box models.
Logistic regression was developed in the early 1800s, and re-popularized in the 1900s. It’s been around for a long time, for many reasons. It solves a common problem (predict the probability of an event), and it’s interpretable. Let’s explore what that means. Here is the logistic equation defining the model:
There are three types of variables in this model equation:
The betas are fit once to the entire dataset. The x’s are different for each instance in the dataset. The p represents an aggregate of dataset behavior: any dataset instance either happened (1) or didn’t (0), but in aggregate, we’d like the right-hand side and the left-hand side to be as close as possible.
The “log(p/(1-p))” is the log odds, also called the “logit of the probability”. The odds are (probability the event happened)/(probability the event won’t happen), or p/(1-p). Then we apply the natural logarithm to translate p, which takes the range 0 to 1, to a quantity which can range from -∞ to +∞, suitable for a linear model.
This model is linear, but for the log odds. That is, the right-hand side is a linear equation, but it is fit to the log odds, not the probability of an event.
This model is interpretable as follows: a unit increase in xi is a log-odds increase in 𝛽i.
For example, suppose we’re predicting probability of loan default, and our model has a feature coefficient 𝛽1=0.15 for the loan amount feature x1. That means a unit increase in the feature corresponds to a log odds increase of 0.15 in default. We can take the natural exponent to get the odds ratio, exp(0.15)=1.1618. That means:
for this model, a unit increase (of say, a thousand dollars) in loan amount corresponds to a 16% increase in the odds of loan default, holding all other factors constant.
This statement is what people mean when they say logistic regression is interpretable.
To summarize why logistic regression is a white box model:
So why would we use anything other than the friendly, venerable model of logistic regression?
Well, if the features and log odds don’t have a linear relationship, this model won’t fit well. I always think of trying to fit a line to a parabola:
If you have non-linear data (the black parabola), a linear fit (the blue dashed line) will never be great. No line fits the curve.
Generalized Additive Models (GAMs) were developed in the 1990s by Hastie and Tibshirani. (See also chapter 9 of their book “The Elements of Statistical Learning”.) Here is the equation defining the model:
This equation is quite similar to logistic regression. It has the same three types of elements:
The big difference is instead of a linear term 𝛽ixi for a feature, now we have a function fi(xi). In their book, Hastie and Tibshirani specify a “smooth” function like a cubic spline. Lou et al.  looked at other functions for the fi, which they call “shape functions.”
A GAM also has white box features:
Now a term, instead of being a constant (beta), is a function, so instead of reporting the log odds as a number, we visualize it with a graph. In fact, the graph above of pneumonia risk of death by age is one term (shape function) in a GAM.
So why would we use anything other than a GAM? It’s already flexible and interpretable. Same reason as before: it might not be accurate enough. In particular, we’ve assumed that each feature response can be modeled with its own function, independent of the others.
But what if there are interactions between the features? Several black box models (boosted trees, neural networks) can model interaction terms. Let us walk through a white box model that also can: GA2Ms.
GA2Ms were investigated in 2013 by Lou et al. . The authors pronounce them with the letters “gee ay two em”, but in house we’ve taken to calling them “interaction GAMs” because it’s more pronounceable. Here is the model equation:
This equation is quite similar to the GAM equation from the previous section, except it adds functions that can account for two feature variables at once, i.e. interaction terms.
Microsoft just released a library InterpretML that implements GA2Ms in python. In that library, they call them “Explainable Boosting Machines.”
Lou et al. say these are still white box models because the “shape function” for an interaction term is a heatmap. The two features are along the X and Y axis, and the color in the middle shows the function response. Here is an example from Microsoft’s library fit to predicting loan default on a dataset of loan performance from lending club:
For this example graph:
This particular heatmap is hard to reason about. This is likely only the interaction effect without the single-feature terms. So, it could be that the probability of default overall isn’t higher at high-dti and high-fico, but rather just higher than either of the primary effects predict by themselves. To investigate further, we could probably look at some examples around the borders. But, for this blog post, we’ll skip the deep dive.
In practice, this library fits all single-feature functions, then N interaction terms, where you pick N. It is not easy to pick N. The interaction terms are worthwhile if they add enough accuracy to be worth the extra complexity of staring at heatmaps to interpret them. That is a judgement call that depends on your business situation.
To perform machine learning, first pick a goal. Then pick a technology that will best use your data to meet the goal. There are thousands of books and millions of papers on that subject. But, here is a drastically simplified way to think about how GA2Ms fit in to possible model technologies: they are on a spectrum from interpretability to modeling feature interactions.
In all cases, you may well need domain-specific data preprocessing, like squaring images, or standardizing features (subtracting the mean and dividing by the standard deviation). That is a topic for another day.
Now hopefully the diagram we started with makes more sense.