Explaining Machine Learning Models Using Contextual Importance and Contextual Utility

Introduction

Explainability is a hot topic in the machine learning research community these days. Over the past few years, many methods have been introduced to understand how a machine learning model makes a prediction. However, explainability is not an entirely new concept, and it was actually started a few decades ago. In this blog post, I will introduce a rather unknown but simple technique that was introduced almost 20 years ago. This technique is called Contextual Importance and Utility (CIU) for explaining ML models and show you how we can explain any types of machine learning. This method relies on the notion of context is important.

For example, imagine we are trying to predict house prices from a set of features such as the number of bedrooms and pools. If every house in the dataset has no pool (the current context), then the feature corresponding to it has no usefulness and no importance for predicting a model. On the other hand, in a city where most houses have one or two bedrooms (again the current context), houses with three or more bedrooms are more unusual.

What Kinds of explanation does CIU generate?

  1. It is a model-agnostic methods, and it can explain the output of any “black-box” machine learning model.

  2. It produces local explanations, which means that the explanations are generated for individual instances (not the whole model), and they show which features are more important for an individual observation.

  3. It gives us post-hoc explanations as it is a method that processes the output of a machine learning model after training.

Unlike LIME and many other techniques, CIU does not approximate or transforms what a model predicts but instead directly explain predictions. It can also provide a contrastive explanation. For instance, why did the model predict rainy and not cloudy?

How does CIU work?

CIT estimates two values that aim to explain the context in which a machine learning model predicts:

Contextual Importance (CI) measures how much change in the range and output values can be attributed to one (or several) input variables. CU is based on the notion that a variable which results in a broader ranger of output values would be more critical. Formally, CIU is defined as follows:

CI = (Cmax - Cmin)/(absmax - absmin)

Contextual Utility (CU) indicates how favorable the current value of one (or several) input variables is for a high output value. CU is computed using the following formula:

CU = (out - Cmin)/(Cmax - Cmin)

Cmax and Cmin are the highest and lowest values that the output of an ML model can take by changing the input feature(s). Obtaining Cmax and Cmin is computationally, and mathematically is not a trivial task. In the original paper, these values are computed using a Monte Carlo simulation, where a lot of observations were generated. Also, absmax and absmin indicate the absolute range of values that the output has taken. For example, In classification problems, the absolute minimum and maximum range of values are the predicted probabilities of machine learning models between 0 and 1.

CIU is implemented both in python and R. For simplicity, I will use its python implementation (py-ciu library) in this blogpost.

You can install py-ciu using the pip command:

pip install py-ciu

A toy example: predicting breast

I will use the breast cancer dataset in scikit-learn to show how we can use CIU. I will train three different machine learning models, including a decision tree, a random forest, and a gradient boosting algorithm on this dataset and compute CI and CU values for a single instance from the test dataset.

First, we need to load the necessary libraries and modules.

from ciu import determine_ciu
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# for reproducability
np.random.seed(123)

Then we split the dataset into a training and test set. We train our machine learning models on the training dataset and evaluate their performance on the test dataset. Note that for explaining ML models, we should use samples from the test dataset and not the training dataset.

X = pd.DataFrame(load_breast_cancer()['data'])
y = load_breast_cancer()['target']
X.columns = load_breast_cancer()['feature_names']
X_train,X_test, y_train,y_test = train_test_split(X,y,stratify = y)
def fit_evaluate_model(clf):
  clf = clf.fit(X_train, y_train)
  print(' Accuracy on test dataset {}'.format(clf.score(X_test,y_test)))
  return clf

Permutation feature importance

As mentioned before, CIU only generates local explanations and does not give us a global overview of how a model makes a prediction. To gain a better understanding of the global importance of the model, we can compute the permutation feature importance scores:

def print_permutation_importance(model):
  imp_features = []
  pi = permutation_importance(model, X_test, y_test,
                            n_repeats=30,
                           random_state=0)
  for i in pi.importances_mean.argsort()[::-1]:
       if pi.importances_mean[i] - 2 * pi.importances_std[i] > 0:
           print(f"{X_test.columns[i]:<8} "
                 f"{pi.importances_mean[i]:.3f} "
                 f" +/- {pi.importances_std[i]:.3f}")
           imp_features.append(pi.importances_mean[i])
           if len(imp_features) == 0:
                print('no important features')

Decision Tree Classifier

Since we just used a toy example, I will not be very picky about my model’s hyper-parameters and leave them to be the default values in sklearn.

dt = DecisionTreeClassifier()
dt_fit = fit_evaluate_model(dt)
##  Accuracy on test dataset 0.9370629370629371
print_permutation_importance(dt_fit)
## worst perimeter 0.173  +/- 0.019
## worst concave points 0.145  +/- 0.023
## worst concavity 0.135  +/- 0.017
## worst area 0.063  +/- 0.014
## radius error 0.036  +/- 0.014
## worst smoothness 0.018  +/- 0.008
## mean area 0.017  +/- 0.006

Random Forest Classifier

rf = RandomForestClassifier(
)
rf_fit = fit_evaluate_model(rf)
##  Accuracy on test dataset 0.972027972027972
print_permutation_importance(rf_fit)
## worst texture 0.023  +/- 0.004
## mean texture 0.013  +/- 0.006
## worst smoothness 0.010  +/- 0.004
## mean concavity 0.010  +/- 0.005
## worst fractal dimension 0.006  +/- 0.003

Gradient Boosting Classifier

gb = GradientBoostingClassifier()
gb_fit = fit_evaluate_model(gb)
##  Accuracy on test dataset 0.9790209790209791
print_permutation_importance(gb_fit)
## worst concave points 0.023  +/- 0.011
## mean concave points 0.021  +/- 0.010

The random forest and gradient boosting classifiers have the same accuracy score; however, their most important features are different.

Explaining a single observation

Now let us explain how each model predicts a single example (observation) from the test dataset.

example = X_test.iloc[1,:]
example_prediction = gb.predict(example.values.reshape(1, -1))
example_prediction_prob = gb.predict_proba(example.values.reshape(1, -1))
prediction_index = 0 if example_prediction > 0.5 else 1
print(f'Prediction {example_prediction}; Probability: {example_prediction_prob}')
## Prediction [1]; Probability: [[0.10952357 0.89047643]]

To obtain a CIU score, we need to compute the minimum and maximum observed value of each feature in the dataset.

def min_max_features(X_train):
  min_max = dict()
  for i in range(len(X_train.columns)):
      min_max[X_train.columns[i]] =[X_train.iloc[:,i].min(),X_train.iloc[:,i].max(),False]
  return min_max
  
min_max = min_max_features(X_train)
def explain_ciu(example,model):
  ciu = determine_ciu(
      example.to_dict(),
      model.predict_proba,
      min_max,
      1000,
      prediction_index,
  )
  return ciu
dt_ciu = explain_ciu(example,dt_fit)
rf_ciu = explain_ciu(example,rf_fit)
gb_ciu = explain_ciu(example,gb_fit)

Generating Textual Explanations

We can obtain a textual explanation of CIU which indicates which feature(s) can be important for our test example

dt_ciu.text_explain()
## ['The feature "mean radius", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean texture", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean perimeter", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean area", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean smoothness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean compactness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean concavity", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean concave points", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean symmetry", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean fractal dimension", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "radius error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "texture error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "perimeter error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "area error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "smoothness error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "compactness error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "concavity error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "concave points error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "symmetry error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "fractal dimension error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst radius", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst texture", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst perimeter", which is highly important (CI=100.0%), is very typical for its class (CU=100.0%).', 'The feature "worst area", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst smoothness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst compactness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst concavity", which is highly important (CI=100.0%), is very typical for its class (CU=100.0%).', 'The feature "worst concave points", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst symmetry", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst fractal dimension", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).']
rf_ciu.text_explain()
## ['The feature "mean radius", which is important (CI=32.26%), is very typical for its class (CU=90.0%).', 'The feature "mean texture", which is important (CI=35.48%), is unlikely for its class (CU=27.27%).', 'The feature "mean perimeter", which is not important (CI=12.9%), is typical for its class (CU=50.0%).', 'The feature "mean area", which is not important (CI=16.13%), is unlikely for its class (CU=40.0%).', 'The feature "mean smoothness", which is not important (CI=12.9%), is typical for its class (CU=50.0%).', 'The feature "mean compactness", which is not important (CI=9.68%), is unlikely for its class (CU=33.33%).', 'The feature "mean concavity", which is not important (CI=16.13%), is not typical for its class (CU=20.0%).', 'The feature "mean concave points", which is not important (CI=19.35%), is not typical for its class (CU=16.67%).', 'The feature "mean symmetry", which is important (CI=38.71%), is very typical for its class (CU=100.0%).', 'The feature "mean fractal dimension", which is not important (CI=6.45%), is not typical for its class (CU=0.1%).', 'The feature "radius error", which is not important (CI=22.58%), is typical for its class (CU=71.43%).', 'The feature "texture error", which is not important (CI=22.58%), is very typical for its class (CU=85.71%).', 'The feature "perimeter error", which is not important (CI=22.58%), is unlikely for its class (CU=42.86%).', 'The feature "area error", which is important (CI=38.71%), is unlikely for its class (CU=33.33%).', 'The feature "smoothness error", which is not important (CI=3.23%), is very typical for its class (CU=100.0%).', 'The feature "compactness error", which is not important (CI=12.9%), is typical for its class (CU=50.0%).', 'The feature "concavity error", which is not important (CI=6.45%), is very typical for its class (CU=100.0%).', 'The feature "concave points error", which is not important (CI=9.68%), is typical for its class (CU=66.67%).', 'The feature "symmetry error", which is not important (CI=6.45%), is typical for its class (CU=50.0%).', 'The feature "fractal dimension error", which is not important (CI=19.35%), is very typical for its class (CU=100.0%).', 'The feature "worst radius", which is very important (CI=51.61%), is very typical for its class (CU=87.5%).', 'The feature "worst texture", which is very important (CI=67.74%), is unlikely for its class (CU=33.33%).', 'The feature "worst perimeter", which is very important (CI=70.97%), is typical for its class (CU=63.64%).', 'The feature "worst area", which is very important (CI=61.29%), is typical for its class (CU=57.89%).', 'The feature "worst smoothness", which is not important (CI=6.45%), is typical for its class (CU=50.0%).', 'The feature "worst compactness", which is not important (CI=9.68%), is unlikely for its class (CU=33.33%).', 'The feature "worst concavity", which is very important (CI=64.52%), is very typical for its class (CU=85.0%).', 'The feature "worst concave points", which is important (CI=38.71%), is not typical for its class (CU=16.67%).', 'The feature "worst symmetry", which is important (CI=25.81%), is typical for its class (CU=50.0%).', 'The feature "worst fractal dimension", which is not important (CI=3.23%), is not typical for its class (CU=0.1%).']
gb_ciu.text_explain()
## ['The feature "mean radius", which is not important (CI=16.49%), is not typical for its class (CU=0.65%).', 'The feature "mean texture", which is highly important (CI=90.14%), is not typical for its class (CU=3.76%).', 'The feature "mean perimeter", which is not important (CI=2.63%), is not typical for its class (CU=0.1%).', 'The feature "mean area", which is not important (CI=3.36%), is very typical for its class (CU=100.0%).', 'The feature "mean smoothness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean compactness", which is important (CI=34.53%), is not typical for its class (CU=0.1%).', 'The feature "mean concavity", which is not important (CI=4.0%), is not typical for its class (CU=8.92%).', 'The feature "mean concave points", which is important (CI=38.25%), is not typical for its class (CU=3.57%).', 'The feature "mean symmetry", which is not important (CI=8.91%), is very typical for its class (CU=100.0%).', 'The feature "mean fractal dimension", which is not important (CI=1.54%), is not typical for its class (CU=0.1%).', 'The feature "radius error", which is not important (CI=10.53%), is not typical for its class (CU=0.1%).', 'The feature "texture error", which is not important (CI=6.53%), is very typical for its class (CU=100.0%).', 'The feature "perimeter error", which is not important (CI=1.48%), is not typical for its class (CU=0.1%).', 'The feature "area error", which is very important (CI=57.97%), is not typical for its class (CU=0.1%).', 'The feature "smoothness error", which is not important (CI=16.51%), is not typical for its class (CU=0.1%).', 'The feature "compactness error", which is not important (CI=4.39%), is not typical for its class (CU=0.1%).', 'The feature "concavity error", which is not important (CI=4.03%), is not typical for its class (CU=0.1%).', 'The feature "concave points error", which is not important (CI=5.76%), is very typical for its class (CU=100.0%).', 'The feature "symmetry error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "fractal dimension error", which is not important (CI=21.47%), is not typical for its class (CU=17.33%).', 'The feature "worst radius", which is not important (CI=1.27%), is very typical for its class (CU=100.0%).', 'The feature "worst texture", which is very important (CI=60.61%), is not typical for its class (CU=13.75%).', 'The feature "worst perimeter", which is important (CI=41.37%), is not typical for its class (CU=23.17%).', 'The feature "worst area", which is not important (CI=19.51%), is typical for its class (CU=67.91%).', 'The feature "worst smoothness", which is not important (CI=18.24%), is unlikely for its class (CU=48.97%).', 'The feature "worst compactness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst concavity", which is not important (CI=10.79%), is very typical for its class (CU=100.0%).', 'The feature "worst concave points", which is important (CI=42.94%), is not typical for its class (CU=4.32%).', 'The feature "worst symmetry", which is not important (CI=5.86%), is not typical for its class (CU=0.1%).', 'The feature "worst fractal dimension", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).']

Drawbacks

Although CIU is a brilliant and simple technique, I believe it has the following drawbacks:

  1. In regression problems, the range of possible values for the target variable can be infinite, which somehow does not make sense when we want to compute CIU. The authors said that they had put a limit on the range of values.

  2. Computing the range of values can be a little bit misleading, especially when we have outliers in the dataset.

  3. It is not clear how we can get a global explanation for the model using CIU.

Avatar
Muhammad Chenariyan Nakhaee
Machine Learning Researcher

I am Muhammad, a data scientist, and a machine learning enthusiast.

Related