XPER: Unveiling the Driving Forces of Predictive Performance

A new method for decomposing your favorite performance metrics

9 min read

10 hours ago

Photo by Sira Anamwong on 123RF

Co-authored with S. Hué, C. Hurlin, and C. Pérignon.

I – From explaining model forecasts to explaining model performance

Trustability and acceptability of sensitive AI systems largely depend on the capacity of the users to understand the associated models, or at least their forecasts. To lift the veil on opaque AI applications, eXplainable AI (XAI) methods such as post-hoc interpretability tools (e.g. SHAP, LIME), are commonly utilized today, and the insights generated from their outputs are now widely comprehended.

Beyond individual forecasts, we show in this article how to identify the drivers of the performance metrics (e.g. AUC, R2) of any classification or regression model using the eXplainable PERformance (XPER) methodology. Being able to identify the driving forces of the statistical or economic performance of a predictive model lies at the very core of modeling and is of great importance for both data scientists and experts basing their decisions on such models. The XPER library outlined below has proven to be an efficient tool to decompose performance metrics into individual feature contributions.

While they are grounded in the same mathematical principles, XPER and SHAP are fundamentally different and simply have different goals. While SHAP pinpoints the features that significantly influence the model’s individual predictions, XPER identifies the features that contribute the most to the performance of the model. The latter analysis can be conducted at the global (model) level or local (instance) level. In practice, the feature with the strongest impact on individual forecasts (say feature A) may not be the one with the strongest impact on performance. Indeed, feature A drives individual decisions when the model is correct but also when the model makes an error. Conceptually, if feature A mainly impacts erroneous predictions, it may rank lower with XPER than it does with SHAP.

What is a performance decomposition used for? First, it can enhance any post-hoc interpretability analysis by offering a more comprehensive insight into the model’s inner workings. This allows for a deeper understanding of why the model is, or is not, performing effectively. Second, XPER can help identify and address heterogeneity concerns. Indeed, by analyzing individual XPER values, it is possible to pinpoint subsamples in which the features have similar effects on performance. Then, one can estimate a separate model for each subsample to boost the predictive performance. Third, XPER can help to understand the origin of overfitting. Indeed, XPER permits us to identify some features which contribute more to the performance of the model in the training sample than in the test sample.

II – XPER values

The XPER framework is a theoretically grounded method that is based on Shapley values (Shapley, 1953), a decomposition method from coalitional game theory. While the Shapley values decompose a payoff among players in a game, XPER values decompose a performance metric (e.g., AUC, R2) among features in a model.

Suppose that we train a classification model using three features and that its predictive performance is measured with an AUC equal to 0.78. An example of XPER decomposition is the following:

The first XPER value 𝜙₀ is referred to as the benchmark and represents the performance of the model if none of the three features provided any relevant information to predict the target variable. When the AUC is used to evaluate the predictive performance of a model, the value of the benchmark corresponds to a random classification. As the AUC of the model is greater than 0.50, it implies that at least one feature contains useful information to predict the target variable. The difference between the AUC of the model and the benchmark represents the contribution of features to the performance of the model, which can be decomposed with XPER values. In this example, the decomposition indicates that the first feature is the main driver of the predictive performance of the model as it explains half of the difference between the AUC of the model and a random classification (𝜙₁), followed by the second feature (𝜙₂) and the third one (𝜙₃). These results thus measure the global effect of each feature on the predictive performance of the model and to rank them from the least important (the third feature) to the most important (the first feature).

While the XPER framework can be used to conduct a global analysis of the model predictive performance, it can also be used to provide a local analysis at the instance level. At the local level, the XPER value corresponds to the contribution of a given instance and feature to the predictive performance of the model. The benchmark then represents the contribution of a given observation to the predictive performance if the target variable was independent from the features, and the difference between the individual contribution and the benchmark is explained by individual XPER values. Therefore, individual XPER values allow us to understand why some observations contribute more to the predictive performance of a model than others, and can be used to address heterogeneity issues by identifying groups of individuals for which features have similar effects on performance.

It is also important to note that XPER is both model and metric-agnostic. It implies that XPER values can be used to interpret the predictive performance of any econometric or machine learning model, and to break down any performance metric, such as predictive accuracy measures (AUC, accuracy), statistical loss functions (MSE, MAE), or economic performance measure (profit-and-loss functions).

III – XPER in Python

01 — Download Library ⚙️

The XPER approach is implemented in Python through the XPER library. To compute XPER values, first one has to install the XPER library as follows:

pip install XPER

02 — Import Library 📦

import XPER
import pandas as pd

03 — Load example dataset 💽

To illustrate how to use XPER values in Python, let us take a concrete example. Consider a classification problem whose main objective is to predict credit default. The dataset can be directly imported from the XPER library such as:

import XPER
from XPER.datasets.load_data import loan_status
loan = loan_status().iloc[:, :6]

display(loan.head())
display(loan.shape)

The primary goal of this dataset, given the included variables, appears to be building a predictive model to determine the “Loan_Status” of a potential borrower. In other words, we want to predict whether a loan application will be approved (“1”) or not (“0”) based on the information provided by the applicant.

# Remove 'Loan_Status' column from 'loan' dataframe and assign it to 'X'
X = loan.drop(columns='Loan_Status')

# Create a new dataframe 'Y' containing only the 'Loan_Status' column from 'loan' dataframe
Y = pd.Series(loan['Loan_Status'])

04 — Estimate the Model ⚙️

Then, we need to train a predictive model and to measure its performance in order to compute the associated XPER values. For illustration purposes, we split the initial dataset into a training and a test set and fit a XGBoost classifier on the training set:

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
# X: input features
# Y: target variable
# test_size: the proportion of the dataset to include in the testing set (in this case, 15%)
# random_state: the seed value used by the random number generator for reproducible results
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=3)

import xgboost as xgb

# Create an XGBoost classifier object
gridXGBOOST = xgb.XGBClassifier(eval_metric="error")

# Train the XGBoost classifier on the training data
model = gridXGBOOST.fit(X_train, y_train)

05 — Evaluate Performance 🎯

The XPER library offers an intuitive and simple way to compute the predictive performance of a predictive model. Considering that the performance metric of interest is the Areas Under the ROC Curve (AUC), it can be measured on the test set as follows:

from XPER.compute.Performance import ModelPerformance

# Define the evaluation metric(s) to be used
XPER = ModelPerformance(X_train.values,
y_train.values,
X_test.values,
y_test.values,
model)

# Evaluate the model performance using the specified metric(s)
PM = XPER.evaluate(["AUC"])

# Print the performance metrics
print("Performance Metrics: ", round(PM, 3))

06 — Calculate XPER values ⭐️

Finally, to explain the driving forces of the AUC the XPER values can be computed such as:

# Calculate XPER values for the model's performance
XPER_values = XPER.calculate_XPER_values(["AUC"],kernel=False)

The « XPER_values » is a tuple including two elements: the XPER values and the individual XPER values of the features.

For use cases above 10 feature variables it is advised to used the default option kernel=True for computation efficiency ➡️

07 — Visualization 📊

from XPER.viz.Visualisation import visualizationClass as viz

labels = list(loan.drop(columns='Loan_Status').columns)

To analyze the driving force at the global level, the XPER library proposes a bar plot representation of XPER values.

viz.bar_plot(XPER_values=XPER_values, X_test=X_test, labels=labels, p=5,percentage=True)

For ease of presentation, feature contributions are expressed in percentage of the spread between the AUC and its benchmark, i.e., 0.5 for the AUC, and are ordered from the largest to lowest. From this figure, we can see that more than 78% of the over-performance of the model over a random predictor comes from Credit History, followed by Applicant Income contributing around 16% to the performance, and Co-applicant Income and Loan Amount Term each accounting for less than 6%. On the other hand, we can see that the variable Loan Amount almost does not help the model to better predict the probability of default as its contribution is close to 0.

The XPER library also proposes graphical representations to analyze XPER values at the local level. First, a force plot can be used to analyze driving forces of the performance for a given observation:

viz.force_plot(XPER_values=XPER_values, instance=1, X_test=X_test, variable_name=labels, figsize=(16,4))

The preceding code plots the positive (negative) XPER values of the observation #10 in red (blue), as well as the benchmark (0.33) and contribution (0.46) of this observation to the AUC of the model. The over-performance of borrower #10 is due to the positive XPER values of Loan Amount Term, Applicant Income, and Credit History. On the other hand, Co-Applicant Income and Loan Amount had a negative effect and decreased the contribution of this borrower.

We can see that while Applicant Income and Loan Amount have a positive effect on the AUC at the global level, these variables have a negative effect for the borrower #10. Analysis of individual XPER values can thus identify groups of observations for which features have different effects on performance, potentially highlighting an heterogeneity issue.

Second, it is possible to represent the XPER values of each observation and feature on a single plot. For that purpose, one can rely on a beeswarm plot which represents the XPER values for each feature as a function of the feature value.

viz.beeswarn_plot(XPER_values=XPER_values,X_test=X_test,labels=labels)

On this figure, each dot represents an observation. The horizontal axis represents the contribution of each observation to the performance of the model, while the vertical axis represents the magnitude of feature values. Similarly to the bar plot shown previously, features are ordered from those that contribute the most to the performance of the model to those that contribute the least. However, with the beeswarm plot it is also possible to analyze the effect of feature values on XPER values. In this example, we can see large values of Credit History are associated with relatively small contributions (in absolute value), whereas low values lead to larger contributions (in absolute value).

All images, unless otherwise stated, are by the author.

IV – Acknowledgements

The contributors to this library are:

V – References

[1] L. Shapley, A Value for n-Person Games (1953), Contributions to the Theory of Games, 2:307–317

[2] S. Lundberg, S. Lee, A unified approach to interpreting model predictions (2017), Advances in Neural Information Processing Systems

[3] S. Hué, C. Hurlin, C. Pérignon, S. Saurin, Measuring the Driving Forces of Predictive Performance: Application to Credit Scoring (2023), HEC Paris Research Paper No. FIN-2022–1463