Explaining Machine Learning Models Using Contextual Importance and Contextual Utility

Introduction

Explainablity is a very hot topics in the machine learning research community these days. over the past few years, many methods have been introduced to just understand how a machine learning model makes prediction. However, explainablity is not an entirely new concept and it was actually started a few decades ago. In this blogpost, I will introduce you to a rather unknown but simple technique which was introduced almost 20 years ago. This technique is called Contextual Importance and Utility (CIU) for explaining ML models and show you how we can explain any types of machine learning. This method relies on the notion of context is important

For example, imagine we are trying to predict house prices from a set of features such as number of bedrooms and number of pools. If every house in the dataset has no pool (the current context), then the feature corresponding to it has no usefulness and no importance for the prediction of a model. On the other hand, in a city where the majority of houses have one or two bedrooms (again the current context), houses with three or more bedrooms are more special.

What Kinds of explanation does CIU generate?

  1. It is a model-agnostic methods and it can explain the output of any “black-box” machine learning model.
  2. It produces local explanations which means that the explanations are generated for individual instances (not the whole model) and they just show which features are more important with respect to an individual observation.
  3. It gives us post-hoc explanations as it is a method that processes the output of a machine learning model after training.

Unlike LIME and many other techniques, CIU does not approximate or transforms what a model predicts but rather directly explain predictions. It can also provide a contrastive explanation. For instance, why did the model predict rainy and not cloudy?

How does CIU work?

CIT estimates two values that aim to explain the context in which a machine learning model predicts:

Contextual Importance (CI) is a measure of how much of change in the range an output values can be attributed to one (or several) input variables. CU is based on the notion that a variable which results in a wider ranger of output values would be more important. Formally, CIU is defined as follows:

CI = (Cmax - Cmin)/(absmax - absmin)

Contextual Utility (CU) indicates how favorable the current value of one (or several) input variables is for a high output value. CU is computed using the following formula:

CU = (out - Cmin)/(Cmax - Cmin)

Cmax and Cmin are the highest and lowest values that the output of an ML model can take by changing the input feature(s). Obtaining Cmax and Cmin is computationally and mathematically is not a trivial task. In the original paper, these values are computed using a Monte Carlo simulation where a a lot observations were generated. Also, absmax and absmin indicate the absolute range of values that the output has taken. For example, In classification problems, the absolute minimum and maximum range of values are the predicted probabilities of machine learning models and are between 0 and 1.

CIU is implemented both in python and R. For simplicity, I will use its python implementation (py-ciu library) in this blogpost.

You can install py-ciu using the pip command:

pip install py-ciu

A toy example: predicting breast

I will use breast cancer dataset in scikit-learn to show how we can use CIU. I will train three different machine learning models including a decision tree, a random forest and a gradient boosting algorithm on this dataset and compute CI and CU values for a single instance from test dataset.

First we need to load necessary libraries and modules.

from ciu import determine_ciu
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# for reproducability
np.random.seed(123)

Then we split the dataset into a training and test set. We train our machine learning models on the training dataset and evaluate their performance on the test dataset. Note that for explaining ML models, we also use samples from the test dataset and not the training dataset.

X = pd.DataFrame(load_breast_cancer()['data'])
y = load_breast_cancer()['target']
X.columns = load_breast_cancer()['feature_names']
X_train,X_test, y_train,y_test = train_test_split(X,y,stratify = y)
def fit_evaluate_model(clf):
  clf = clf.fit(X_train, y_train)
  print(' Accuracy on test dataset {}'.format(clf.score(X_test,y_test)))
  return clf

Permutation feature importance

As I mentioned before, CIU only generates local explanations and doesn’t give us an global overview of how a model makes prediction. To gain a better understanding of the global importance of the model let us compute permutation feature importance scores that is implemented in scikit-learn.

def print_permutation_importance(model):
  imp_features = []
  pi = permutation_importance(model, X_test, y_test,
                            n_repeats=30,
                           random_state=0)
  for i in pi.importances_mean.argsort()[::-1]:
       if pi.importances_mean[i] - 2 * pi.importances_std[i] > 0:
           print(f"{X_test.columns[i]:<8} "
                 f"{pi.importances_mean[i]:.3f} "
                 f" +/- {pi.importances_std[i]:.3f}")
           imp_features.append(pi.importances_mean[i])
           if len(imp_features) == 0:
                print('no important features')

Decision Tree Classifier

Since it is just a toy example, I won’t be very picky about hyper-parameters of my models and leave them to their default values in sklearn.

dt = DecisionTreeClassifier()
dt_fit = fit_evaluate_model(dt)
##  Accuracy on test dataset 0.9370629370629371
print_permutation_importance(dt_fit)
## worst perimeter 0.173  +/- 0.019
## worst concave points 0.145  +/- 0.023
## worst concavity 0.135  +/- 0.017
## worst area 0.063  +/- 0.014
## radius error 0.036  +/- 0.014
## worst smoothness 0.018  +/- 0.008
## mean area 0.017  +/- 0.006

Random Forest Classifier

rf = RandomForestClassifier(
)
rf_fit = fit_evaluate_model(rf)
##  Accuracy on test dataset 0.972027972027972
print_permutation_importance(rf_fit)
## worst texture 0.023  +/- 0.004
## mean texture 0.013  +/- 0.006
## worst smoothness 0.010  +/- 0.004
## mean concavity 0.010  +/- 0.005
## worst fractal dimension 0.006  +/- 0.003

Gradient Boosting Classifier

gb = GradientBoostingClassifier()
gb_fit = fit_evaluate_model(gb)
##  Accuracy on test dataset 0.9790209790209791
print_permutation_importance(gb_fit)
## worst concave points 0.024  +/- 0.011
## mean concave points 0.021  +/- 0.010

The random forest and gradient boosting classifiers have exactly the same accuracy score, however their most important features are different.

Explaining a single observation

Now lets explain how each model makes prediction on a single example (observation) from the test dataset.

example = X_test.iloc[1,:]
example_prediction = gb.predict(example.values.reshape(1, -1))
example_prediction_prob = gb.predict_proba(example.values.reshape(1, -1))
prediction_index = 0 if example_prediction > 0.5 else 1
print(f'Prediction {example_prediction}; Probability: {example_prediction_prob}')
## Prediction [1]; Probability: [[0.10952357 0.89047643]]

To obtain CIU score, we need to compute minimum and maximum observed value of each feature in the dataset first.

def min_max_features(X_train):
  min_max = dict()
  for i in range(len(X_train.columns)):
      min_max[X_train.columns[i]] =[X_train.iloc[:,i].min(),X_train.iloc[:,i].max(),False]
  return min_max
  
min_max = min_max_features(X_train)
def explain_ciu(example,model):
  ciu = determine_ciu(
      example.to_dict(),
      model.predict_proba,
      min_max,
      1000,
      prediction_index,
  )
  return ciu
dt_ciu = explain_ciu(example,dt_fit)
rf_ciu = explain_ciu(example,rf_fit)
gb_ciu = explain_ciu(example,gb_fit)

Generating Textual Explanations

We can obtain a textual explanation of CIU which indicates which feature(s) can be important for our test example

dt_ciu.text_explain()
## ['The feature "mean radius", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean texture", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean perimeter", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean area", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean smoothness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean compactness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean concavity", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean concave points", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean symmetry", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean fractal dimension", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "radius error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "texture error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "perimeter error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "area error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "smoothness error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "compactness error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "concavity error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "concave points error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "symmetry error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "fractal dimension error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst radius", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst texture", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst perimeter", which is highly important (CI=100.0%), is very typical for its class (CU=100.0%).', 'The feature "worst area", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst smoothness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst compactness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst concavity", which is highly important (CI=100.0%), is very typical for its class (CU=100.0%).', 'The feature "worst concave points", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst symmetry", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst fractal dimension", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).']
rf_ciu.text_explain()
## ['The feature "mean radius", which is important (CI=32.26%), is very typical for its class (CU=90.0%).', 'The feature "mean texture", which is important (CI=35.48%), is unlikely for its class (CU=27.27%).', 'The feature "mean perimeter", which is not important (CI=12.9%), is typical for its class (CU=50.0%).', 'The feature "mean area", which is not important (CI=19.35%), is unlikely for its class (CU=33.33%).', 'The feature "mean smoothness", which is not important (CI=12.9%), is typical for its class (CU=50.0%).', 'The feature "mean compactness", which is not important (CI=9.68%), is unlikely for its class (CU=33.33%).', 'The feature "mean concavity", which is not important (CI=16.13%), is not typical for its class (CU=20.0%).', 'The feature "mean concave points", which is not important (CI=19.35%), is not typical for its class (CU=16.67%).', 'The feature "mean symmetry", which is important (CI=38.71%), is very typical for its class (CU=100.0%).', 'The feature "mean fractal dimension", which is not important (CI=6.45%), is not typical for its class (CU=0.1%).', 'The feature "radius error", which is not important (CI=22.58%), is typical for its class (CU=71.43%).', 'The feature "texture error", which is not important (CI=22.58%), is very typical for its class (CU=85.71%).', 'The feature "perimeter error", which is not important (CI=22.58%), is unlikely for its class (CU=42.86%).', 'The feature "area error", which is important (CI=38.71%), is unlikely for its class (CU=33.33%).', 'The feature "smoothness error", which is not important (CI=3.23%), is very typical for its class (CU=100.0%).', 'The feature "compactness error", which is not important (CI=12.9%), is typical for its class (CU=50.0%).', 'The feature "concavity error", which is not important (CI=6.45%), is very typical for its class (CU=100.0%).', 'The feature "concave points error", which is not important (CI=9.68%), is typical for its class (CU=66.67%).', 'The feature "symmetry error", which is not important (CI=6.45%), is typical for its class (CU=50.0%).', 'The feature "fractal dimension error", which is not important (CI=16.13%), is very typical for its class (CU=100.0%).', 'The feature "worst radius", which is very important (CI=51.61%), is very typical for its class (CU=87.5%).', 'The feature "worst texture", which is very important (CI=67.74%), is unlikely for its class (CU=33.33%).', 'The feature "worst perimeter", which is very important (CI=70.97%), is typical for its class (CU=63.64%).', 'The feature "worst area", which is very important (CI=61.29%), is typical for its class (CU=57.89%).', 'The feature "worst smoothness", which is not important (CI=6.45%), is typical for its class (CU=50.0%).', 'The feature "worst compactness", which is not important (CI=9.68%), is unlikely for its class (CU=33.33%).', 'The feature "worst concavity", which is very important (CI=64.52%), is very typical for its class (CU=85.0%).', 'The feature "worst concave points", which is important (CI=38.71%), is not typical for its class (CU=16.67%).', 'The feature "worst symmetry", which is important (CI=25.81%), is typical for its class (CU=50.0%).', 'The feature "worst fractal dimension", which is not important (CI=3.23%), is not typical for its class (CU=0.1%).']
gb_ciu.text_explain()
## ['The feature "mean radius", which is not important (CI=16.49%), is not typical for its class (CU=0.65%).', 'The feature "mean texture", which is highly important (CI=90.14%), is not typical for its class (CU=3.76%).', 'The feature "mean perimeter", which is not important (CI=2.63%), is not typical for its class (CU=0.1%).', 'The feature "mean area", which is not important (CI=3.36%), is very typical for its class (CU=100.0%).', 'The feature "mean smoothness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "mean compactness", which is important (CI=37.26%), is not typical for its class (CU=0.1%).', 'The feature "mean concavity", which is not important (CI=4.0%), is not typical for its class (CU=8.92%).', 'The feature "mean concave points", which is important (CI=38.25%), is not typical for its class (CU=3.57%).', 'The feature "mean symmetry", which is not important (CI=8.91%), is very typical for its class (CU=100.0%).', 'The feature "mean fractal dimension", which is not important (CI=1.54%), is not typical for its class (CU=0.1%).', 'The feature "radius error", which is not important (CI=10.53%), is not typical for its class (CU=0.1%).', 'The feature "texture error", which is not important (CI=6.53%), is very typical for its class (CU=100.0%).', 'The feature "perimeter error", which is not important (CI=1.48%), is not typical for its class (CU=0.1%).', 'The feature "area error", which is very important (CI=57.97%), is not typical for its class (CU=0.1%).', 'The feature "smoothness error", which is not important (CI=16.51%), is not typical for its class (CU=0.1%).', 'The feature "compactness error", which is not important (CI=4.39%), is not typical for its class (CU=0.1%).', 'The feature "concavity error", which is not important (CI=4.03%), is not typical for its class (CU=0.1%).', 'The feature "concave points error", which is not important (CI=5.76%), is very typical for its class (CU=100.0%).', 'The feature "symmetry error", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "fractal dimension error", which is not important (CI=21.47%), is not typical for its class (CU=17.33%).', 'The feature "worst radius", which is not important (CI=1.27%), is very typical for its class (CU=100.0%).', 'The feature "worst texture", which is very important (CI=60.61%), is not typical for its class (CU=13.75%).', 'The feature "worst perimeter", which is important (CI=41.37%), is not typical for its class (CU=23.17%).', 'The feature "worst area", which is not important (CI=19.51%), is typical for its class (CU=67.91%).', 'The feature "worst smoothness", which is not important (CI=18.24%), is unlikely for its class (CU=48.97%).', 'The feature "worst compactness", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).', 'The feature "worst concavity", which is not important (CI=10.79%), is very typical for its class (CU=100.0%).', 'The feature "worst concave points", which is important (CI=42.94%), is not typical for its class (CU=4.32%).', 'The feature "worst symmetry", which is not important (CI=5.86%), is not typical for its class (CU=0.1%).', 'The feature "worst fractal dimension", which is not important (CI=0.0%), is not typical for its class (CU=0.1%).']

Drawbacks

Although CIU is a brilliant and simple technique, I believe it has the following drawbacks:

  1. In regression problems, the range of possible values for the target variable can be infinite, which somehow does not make sense when we want to compute CIU. The authors said that they have put a limit on the range of values.

  2. Computing the range of values can be a little bit misleading especially when we have outliers in the dataset.

  3. It is not clear how we can get a global explanation for the model using CIU.

Avatar
Muhammad Chenariyan Nakhaee
Machine Learning Researcher

I am Muhammad,a data scientist and machine learning enthusiast

Related