Back to blog home

Fiddler & Captum Join Hands to Enhance Explainable AI Offerings

We are excited to announce that Fiddler and Captum, from Facebook AI, are collaborating to push the boundaries of Explainable AI offerings. The goals of this partnership are to help the data science community to improve model understanding and its applications, as well as to promote the usage of Explainable AI in the ML workflow.

Fiddler is an Explainable AI Platform that seamlessly enables business and technical stakeholders to explain, monitor, and analyze AI models in production. Users get access to deep model level actionable insights to understand problem drivers using explanations and efficiently root cause and resolve issues. Captum is a model interpretability suite for PyTorch developed by Facebook AI. It offers a number of attribution algorithms that allow users to understand the importance of inputs features, and hidden neurons and layers.

Need for attributions in ML

ML methods have made remarkable progress over the last decade, achieving super human performance on a variety of tasks. Unfortunately much of the recent progress in machine learning has come at the cost of the models becoming more opaque and “black box” to us humans. With the increasing use of ML models in high stakes domains such as hiring, credit lending, and healthcare, the impact of ML methods on society can be far reaching. Consequently, today, there is a tremendous need for explainability methods from ethical, regulatory, end-user, and model developer perspectives. An overarching question that arises is: why did the model make this prediction? This question is of importance to developers in debugging (mis-)predictions, regulators in assessing the robustness and fairness of the model, and end-users in deciding whether they can trust the model.

One concrete formulation of this why question is the attribution problem. Here, we seek to explain a model’s prediction on an input by attributing the prediction to features of the input. The attributions are meant to be commensurate with each feature’s contribution to the prediction. Attributions are an effective tool for debugging mis-predictions, assessing model bias and unfairness, generating explanations for the end-user, and extracting rules from a model. Over the last few years, several attribution methods have been proposed to explain model predictions. The diagram below shows all attribution algorithms available in the Captum library divided into two groups. The first group, listed on the left side of the diagram, allows us to attribute the output predictions or the internal neurons to the inputs of the model. The second group, listed on the right side, includes several attribution algorithms that allow us to attribute the output predictions to the internal layers of the model. Some of those algorithms in the second group are different variants of the ones in the first group.

A prominent method among them is Integrated Gradients (IG), which has recently become a popular tool for explaining predictions made by deep neural networks. It belongs to the family of gradient-based attribution methods, which compute attributions by examining the gradient of the prediction with respect to the input feature. Gradients essentially capture the sensitivity of the prediction with respect to each feature.

IG operates by considering a straight line path, in feature space, from the input at hand (e.g., an image from a training set) to a certain baseline input (e.g., a black image), and integrating the gradient of the prediction with respect to input features (e.g., image pixels) along this path. The highlight of the method is that it is proven to be unique under a certain set of desirable axioms. As a historical note, the method is derived from the Aumann-Shapley method from cooperative game theory. We refer the interested reader to the paper and this blog post for a thorough analysis of the method. IG is attractive as it is broadly applicable to all differentiable models, is easy to implement in popular ML frameworks, e.g., PyTorch, TensorFlow, and is backed by an axiomatic theory. IG is also much faster than combinatorial calculation based Shapley value methods due the use of gradients. For instance, IG is one of the key explainability methods made available by Captum for PyTorch models.

In the rest of the post, we will demonstrate how IG explanations, enabled by the Captum framework, can be leveraged within Fiddler to explain a toxicity model.

Example — The toxicity model

Detecting conversational toxicity is an important yet challenging task that involves understanding many subtleties and nuances of human language.

Over the last years various different classifiers have built that aim to address these problems and increase the prediction accuracy of machine learning models.

Although prediction accuracy is an important metric to measure, understanding the root causes of how those models reason and whether they are capable of capturing the semantics and unintended bias are crucial for those tasks.

In this case study we fine-tuned BERT classification model on conversational toxicity dataset, performed predictions on a subset of sentences and computed each token’s importance for the predicted samples using integrated gradients. Integrated gradients is a gradient-based attribution algorithm that assigns an importance score to each token embedding by integrating the gradients along the path from a sentence that has missing toxic features to the one that is classified as toxic.

For training purposes we used English wikipedia talk dataset (source: that contains 160k labeled discussion comments.

About 60% of that dataset was used for training, 20% for development and another 20% for testing purposes.

We reached overall 97% training and 96% test accuracy by fine-tuning vanilla Bert binary classification model on the dataset described above.

Side Note: This Unintended ML Bias Analysis presentation contains interesting insights on bias in text classification.

We will be sharing the training script and other analysis notebooks shortly.

Importing the model into Fiddler

Here we give a general run down of the steps to import a PyTorch model, using the example of the above model.

As an Explainable AI Platform a core feature in Fiddler’s Platform is to generate explanations and predictions for the models that are loaded in it. With the Fiddler Python package, uploading models into Fiddler is simple. Once you initialize the Fiddler API object, you can programmatically harness the power of the Fiddler Platform to interpret and explain your models.

It can be done in four simple steps:

Step 1: Create a project — models live inside a project. Think of it as a directory for models.

Step 2: Upload the dataset that the model uses — a dataset can be linked to multiple models.

Step 3: Edit our template — is the interface by which the Fiddler Platform figures out how to load your model, run predictions on it and explain model predictions. More detailed instructions for doing and testing this code will be shared along with the demo notebook for this example. Rest assured that it’s quite easy and well laid out.

Step 4: Upload the model and associated files using our custom_model_import function — all the files must be inside a folder, which is provided to the function.

With four simple steps, the model is now part of Fiddler. You can now use Fiddler’s model analysis and debugging tools, in addition to using it to serve explanations at scale. If this is a production model, you can use Fiddler’s Explainable Monitoring tools to monitor live traffic for this model, analyze data drift, outliers, data integrity of your pipeline, and general service health.

Analyzing the model in Fiddler

Here we show a brief example of how we can use Fiddler’s NLP explain tools to debug and test this model. The comment in question (not from the dataset) is “that man is so silly” We can see that this is a toxic comment. And the word silly is the term that should get the most blame for the toxicity.

From the attributions above, we can see that the term ‘silly’ indeed gets the most blame. Now we can see that ‘man’ also gets a positive attribution. Is it because the model believes that ‘man’ is a toxic word, or is it because ‘man’ in this case is the subject of discussion ? We can use the text edit tool to try and change it to boy and see if it also gets a high positive attribution.

‘Boy’ in fact gets an even higher positive attribution, telling us that the model likely thinks both terms to be toxic, and ‘boy’ more so. It’s quite possible that it’s because the word ‘boy’ occurs in more toxic comments than does ‘man’. We can use Fiddler’s model analysis feature to check if this is true.

This is indeed the case, as the very rough SQL query shows us. A sentence containing boy is toxic roughly a third of the time, as opposed to almost a seventh for ‘man’. Do note that this does not constitute hard proof for the model’s behavior, and neither is it necessarily a fact that ‘boy’ will always get a higher toxic attribution than ‘man’. That will need much deeper analysis to prove, which we plan to address in a subsequent post, along with a thorough investigation of the biases of the model and dataset in general.


Both teams are very excited about the potential of this collaboration on furthering model explainability research and applications. As a first step we’ve made the two interoperable to make it easy for Fiddler users to upload PyTorch models. We plan to share the results of our work regularly, so stay tuned.

Authors: Narine Kokhlikyan, Research Scientist at Facebook AI, Ankur Taly, Head of Data Science at Fiddler Labs, Aalok Shanbhag, Data Scientist at Fiddler Labs