Causal AI, exploring the integration of causal reasoning into machine learning
What is this series of articles about?
Welcome to my series on Causal AI, where we will explore the integration of causal reasoning into machine learning models. Expect to explore a number of practical applications across different business contexts.
In the last article we covered using Double Machine Learning and Linear Programming to optimise treatment strategies. This time we will continue with the theme of optimisation exploring optimising non-linear treatment effects in Pricing & Promotions.
If you missed the last article on Double Machine Learning and Linear Programming, check it out here:
Introduction
This article will showcase how we can optimise non-linear treatment effects in pricing (but the ideas can also be applied across marketing and other domains too).
In this article I will help you understand:
- Why is it common to have non-linear treatment effects in pricing?
- What tools from our Causal AI toolbox are suitable for estimating non-linear treatment effects?
- How can non-linear programming be used to optimise pricing?
- A worked case study in Python working through how we can combine our Causal AI toolbox and non-linear programming to optimise pricing budgets.
The full notebook can be found here:
Why is it common to have non-linear treatment effects in pricing?
Diminishing returns
Let’s take the example of a retailer adjusting the price of a product. Initially lowering the price might lead to a significant increase in sales. However, as they continue to lower the price, the increase in sales may start to plateau. We call this diminishing returns. As illustrated below, the effect of diminishing returns is generally non-linear.
Diminishing returns can be observed across various fields beyond pricing. Some common examples are:
- Marketing — Increasing social media spend can increase customer acquisition, but as time goes on it becomes more difficult to target new, untapped audiences.
- Farming — Adding fertilizer to a field can increase crop yield significantly initially, but this effect will very quickly start to diminish.
- Manufacturing — Adding more workers to a production process will improve efficiencies, but each additional worker may contribute less to the overall output.
This makes me start to wonder, if diminishing returns are so common, then which techniques from our Causal AI toolbox are capable of handling this?
What methods from our Causal AI toolbox are suitable for estimating non-linear treatment effects?
Toolbox
There are two key questions which we will ask to help us identify what methods from our Causal AI toolbox are suitable for our Pricing problem:
- Can it handle continuous treatments?
- Can it capture non-linear treatment effects?
Below we can see a summary of how suitable each method is:
- Propensity score matching (PSM) — Treatment needs to be binary ❌
- Inverse-propensity score matching (IPSM) — Treatment needs to be binary ❌
- T-Learner — Treatment needs to be binary ❌
- Double Machine Learning (DML) — Treatment effect is linear ❌
- Doubly-Robust Learner (DR) — Treatment needs to be binary ❌
- S-Learner — Can handle continuous treatments and non-linear relationships between the treatment and outcome if an appropriate machine learning algorithm (e.g. gradient boosting) is used 💚
S-Learner
The “S” in S-Learner comes from it being a “single model”. An arbitrary machine learning model is used to predict the outcome using the treatment, confounders and other covariates as features. This model is then used to estimate the difference between the potential outcomes under different treatment conditions (which gives us the treatment effect).
The are a number of benefits to the S-Learner:
- It can handle both binary and continuous treatments.
- It can use any machine learning algorithm, giving us the flexibility to capture non-linear relationships for both the features and treatment.
One word of caution: regularisation bias! Modern machine learning algorithms use regularisation to prevent overfitting — but this can be damaging to causal problems. Take the hyper-parameter max features from gradient boosting tree methods — in a number of trees, it is likely that the treatment won’t be included in the model. This will dampen the effect of the treatment.
When using the S-Learner, I recommend thinking carefully about the regularisation parameters e.g. set max features to 1.0 (effectively switching off the feature regularisation).
How can non-linear programming be used to optimise pricing?
Price optimisation
Let’s say we have a number of products and we want to optimise their price given a set promotional budget. For each product we train an S-Learner (using gradient boosting) with the treatment set as discount level and the outcome set as total number of orders. Our S-Leaners output a complex model that can be used to estimate the effect of different discount levels. But how can we optimise the discount levels for each product?
Response Curves
Optimisation techniques such as linear (or even non-linear) programming rely on having a clear functional form of the response. Machine learning techniques like random forests and gradient boosting don’t give us this (unlike say linear regression). However, a response curve can translate the outputs of an S-Learner into a comprehensive form, showing how the outcome responds to the treatment.
If you can’t quite picture how we can create a response curve yet, don’t worry we will cover this in the Python case study!
Michaelis-Menton equation
There are several equations we could use to map the S-Learner to a response curve. One of them is the Micaelis-Menton equation.
The Micaelis-Menton equation is commonly used in enzyme kinetics (the study of the rates at which enzymes catalyse chemical reactions) to describe the rate of enzymatic reactions.
- v — is the reaction velocity (this is our transformed response, so total number of orders in our pricing example)
- Vmax — is the maximum reaction velocity (we will call this alpha, a parameter we need to learn)
- Km — is the substrate concentration (we will call this lambda, a parameter we need to learn)
- S — is the Michaelis constant (this is our treatment, so discount level in our pricing example)
Its principles can also be applied to other fields, especially when dealing with systems where increasing input does not proportionally increase output due to saturation factors. Below we visualise how different values of alpha and lamda effect the curve:
def michaelis_menten(x, alpha, lam):
return alpha * x / (lam + x)
Once we have our response curves, next we can think about optimisation. The Micaelis-Menton gives us a non-linear function. Therefore non-linear programming is an appropriate choice.
Non-linear programming
We covered linear programming in the my last article. Non-linear programing is similar but the objective function and/or constraints are non-linear in nature.
Sequential Least Squares Programming (SLSQP) is an algorithm used for solving non-linear programming problems. It allows for both equality and inequality constraints making it a sensible choice for our use case.
- Equality constraints e.g. Total promotional budget is equal to £100k
- Inequality constraints e.g. Discount on each product between £1 and £10
SciPy have an easy to use implementation of SLSQP:
Next we will illustrate how powerful the combination of the S-Learner, Micaelis-Menton equation and non-linear programing can be!
Case study
Background
Historically the promotions teams have used their expert judgement to set the discount for their 3 top products. Given the current economic conditions, they are being forced to reduce their overall promotional budget by 20%. They turn to the Data Science team to advise how they can do this whilst minimising the loss in orders being placed.
Data generating process
We set up a data generating process with the following characteristics:
- 4 features with a complex relationship with the number of orders
- A treatment effect which follows the Micaelis-Menton equation
def data_generator(n, tau_weight, alpha, lam):# Set number of features
p=4
# Create features
X = np.random.uniform(size=n * p).reshape((n, -1))
# Nuisance parameters
b = (
np.sin(np.pi * X[:, 0])
+ 2 * (X[:, 1] - 0.5) ** 2
+ X[:, 2] * X[:, 3]
)
# Create treatment and treatment effect
T = np.linspace(200, 10000, n)
T_mm = michaelis_menten(T, alpha, lam) * tau_weight
tau = T_mm / T
# Calculate outcome
y = b + T * tau + np.random.normal(size=n) * 0.5
y_train = y
X_train = np.hstack((X, T.reshape(-1, 1)))
return y_train, X_train, T_mm, tau
The X features are confounding variables:
We use the data generator to create samples for 3 products, each with a different treatment effect:
np.random.seed(1234)n=100000
y_train_1, X_train_1, T_mm_1, tau_1 = data_generator(n, 1.00, 2, 5000)
y_train_2, X_train_2, T_mm_2, tau_2 = data_generator(n, 0.25, 2, 5000)
y_train_3, X_train_3, T_mm_3, tau_3 = data_generator(n, 2.00, 2, 5000)
S-Learner
We can train an S-Learner by using any machine learning algorithm and including the treatment and covariates as features:
def train_slearner(X_train, y_train):model = LGBMRegressor(random_state=42)
model.fit(X_train, y_train)
yhat_train = model.predict(X_train)
mse_train = mean_squared_error(y_train, yhat_train)
r2_train = r2_score(y_train, yhat_train)
print(f'MSE on train set is {round(mse_train)}')
print(f'R2 on train set is {round(r2_train, 2)}')
return model, yhat_train
We train an S-Learner for each product:
np.random.seed(1234)model_1, yhat_train_1 = train_slearner(X_train_1, y_train_1)
model_2, yhat_train_2 = train_slearner(X_train_2, y_train_2)
model_3, yhat_train_3 = train_slearner(X_train_3, y_train_3)
At the moment this is just a prediction model — Below we visualise how well it does at this job:
Extracting the treatment effects
Next we will use our S-learner to extract the treatment effect for the full range of treatment values (discount amount) whilst holding other features to their mean value.
We start by extracting the expected outcome (number of orders) for the full range of treatment values:
def extract_treated_effect(n, X_train, model):# Set features to mean value
X_mean_mapping = {'X1': [X_train[:, 0].mean()] * n,
'X2': [X_train[:, 1].mean()] * n,
'X3': [X_train[:, 2].mean()] * n,
'X4': [X_train[:, 3].mean()] * n}
# Create DataFrame
df_scoring = pd.DataFrame(X_mean_mapping)
# Add full range of treatment values
df_scoring['T'] = X_train[:, 4].reshape(-1, 1)
# Calculate outcome prediction for treated
treated = model.predict(df_scoring)
return treated, df_scoring
We do this for each product:
treated_1, df_scoring_1 = extract_treated_effect(n, X_train_1, model_1)
treated_2, df_scoring_2 = extract_treated_effect(n, X_train_2, model_2)
treated_3, df_scoring_3 = extract_treated_effect(n, X_train_3, model_3)
We then extract the expected outcome (number of orders) when the treatment is set to 0:
def extract_untreated_effect(n, X_train, model):# Set features to mean value
X_mean_mapping = {'X1': [X_train[:, 0].mean()] * n,
'X2': [X_train[:, 1].mean()] * n,
'X3': [X_train[:, 2].mean()] * n,
'X4': [X_train[:, 3].mean()] * n,
'T': [0] * n}
# Create DataFrame
df_scoring = pd.DataFrame(X_mean_mapping)
# Add full range of treatment values
df_scoring
# Calculate outcome prediction for treated
untreated = model.predict(df_scoring)
return untreated
Again, we do this for each product:
untreated_1 = extract_untreated_effect(n, X_train_1, model_1)
untreated_2 = extract_untreated_effect(n, X_train_2, model_2)
untreated_3 = extract_untreated_effect(n, X_train_3, model_3)
We can now calculate the treatment effect for the full range of treatment values:
treatment_effect_1 = treated_1 - untreated_1
treatment_effect_2 = treated_2 - untreated_2
treatment_effect_3 = treated_3 - untreated_3
When we compare this to the actual treatment effect which we saved from our data-generator, we can see the S-Learner is very effective at estimating the treatment effects for the full range of treatment values:
Now we have this treatment effect data, we can use it to build response curves for each product.
Michaelis-Menton
To build the response curves, we need a curve fitting tool. SciPy has a great implementation of one which we will use:
We start by setting up the function that we want to learn:
def michaelis_menten(x, alpha, lam):
return alpha * x / (lam + x)
We can then use curve_fit to learn the alpha and lambda parameters:
def response_curves(treatment_effect, df_scoring):maxfev = 100000
lam_initial_estimate = 0.001
alpha_initial_estimate = max(treatment_effect)
initial_guess = [alpha_initial_estimate, lam_initial_estimate]
popt, pcov = curve_fit(michaelis_menten, df_scoring['T'], treatment_effect, p0=initial_guess, maxfev=maxfev)
return popt, pcov
We do this for each product:
popt_1, pcov_1 = response_curves(treatment_effect_1, df_scoring_1)
popt_2, pcov_2 = response_curves(treatment_effect_2, df_scoring_2)
popt_3, pcov_3 = response_curves(treatment_effect_3, df_scoring_3)
We can now feed the learnt parameters into the michaelis menten function to help us visualise how well the curve fitting did:
treatment_effect_curve_1 = michaelis_menten(df_scoring_1['T'], popt_1[0], popt_1[1])
treatment_effect_curve_2 = michaelis_menten(df_scoring_2['T'], popt_2[0], popt_2[1])
treatment_effect_curve_3 = michaelis_menten(df_scoring_3['T'], popt_3[0], popt_3[1])
We can see that the curve fitting did a great job!
Now we have the alpha and lambda parameters for each product, we can start thinking about the non-linear optimisation…
Non-linear programming
We start by setting collating all the required information for the optimisation:
- A list of all the products
- The total promotional budget
- The budget ranges for each product
- The parameters for each product from the Michaelis Menten response curves
# List of products
products = ["product_1", "product_2", "product_3"]# Set total budget to be the sum of the mean of each product reduced by 20%
total_budget = (df_scoring_1['T'].mean() + df_scoring_2['T'].mean() + df_scoring_3['T'].mean()) * 0.80
# Dictionary with min and max bounds for each product - set as +/-20% of max/min discount
budget_ranges = {"product_1": [df_scoring_1['T'].min() * 0.80, df_scoring_1['T'].max() * 1.2],
"product_2": [df_scoring_2['T'].min() * 0.80, df_scoring_2['T'].max() * 1.2],
"product_3": [df_scoring_3['T'].min() * 0.80, df_scoring_3['T'].max() * 1.2]}
# Dictionary with response curve parameters
parameters = {"product_1": [popt_1[0], popt_1[1]],
"product_2": [popt_2[0], popt_2[1]],
"product_3": [popt_3[0], popt_3[1]]}
Next we set up the objective function — We want to maximise orders but as we are going to use a minimisation method, we return the negative of the sum of orders expected.
def objective_function(x, products, parameters):sum_orders = 0.0
# Unpack parameters for each product and calculate expected orders
for product, budget in zip(products, x, strict=False):
L, k = parameters[product]
sum_orders += michaelis_menten(budget, L, k)
return -1 * sum_orders
Finally we can run our optimisation to determine the optimal budget to allocate to each product:
# Set initial guess by equally sharing out the total budget
initial_guess = [total_budget // len(products)] * len(products)# Set the lower and upper bounds for each product
bounds = [budget_ranges[product] for product in products]
# Set the equality constraint - constraining the total budget
constraints = {"type": "eq", "fun": lambda x: np.sum(x) - total_budget}
# Run optimisation
result = minimize(
lambda x: objective_function(x, products, parameters),
initial_guess,
method="SLSQP",
bounds=bounds,
constraints=constraints,
options={'disp': True, 'maxiter': 1000, 'ftol': 1e-9},
)
# Extract results
optimal_treatment = {product: budget for product, budget in zip(products, result.x, strict=False)}
print(f'Optimal promo budget allocations: {optimal_treatment}')
print(f'Optimal orders: {round(result.fun * -1, 2)}')
The output shows us what the optimal promotional budget is for each product:
If you closely inspect the response curves, you will see that what the optimisation results are intuitive:
- Small decrease in the budget for product 1
- Decrease the budget for product 2 significantly
- Increase the budget for product 3 significantly
Closing thoughts
Today we covered the powerful combination of the S-Learner, Micaelis-Menton equation and non-linear programing! Here are some closing thoughts:
- As mentioned earlier, when using the S-Learner beware of regularisation bias!
- I chose to use the Micaelis-Menton equation to build my response curves — However, this may not fit your problem and can be replaced by other transformations which are more suitable.
- Using SLSQP to solve nonlinear programming problems gives you the flexibility to use both equality and inequality constraints.
- I’ve chosen to focus on Pricing & Promotions, but this framework can be extended to Marketing budgets.
Follow me if you want to continue this journey into Causal AI — In the next article we will explore how combining Causal Graphs and Shapley is the key to explainable AI.