Improving Business Performance with Machine Learning

Whether you are a data scientist, analyst, or business analyst, your goal is to deliver projects that improve business performance.

Photo by Daria Nepriakhina 🇺🇦 on Unsplash

It might be tempting to focus on the latest machine learning developments or tackling the big problems. However, you can often deliver great value by solving low-hanging fruit with simple machine-learning algorithms.

Benchmarking is one of those low-hanging fruits. It is the process of measuring business KPIs against similar organizations. It allows businesses to learn from the best and continuously improve performance.

There are two types of Benchmarking:

1. Internal: measure KPI against units/products in the same company
2. External: measure KPI against competitors

In my daily work in the hotel industry, we often rely on third-party companies that collect hotel data for external benchmarking. However, the data we get from them is limited. On the other hand, we manage over 500 hotels and are sitting on vast amounts of data for potential benchmarking.

This is the low-hanging fruit we set up to solve recently.

No matter which type of benchmarking exercise you are conducting, the first step is to select a set of hotels similar to the subject hotel. In the hotel industry, we usually rely on location indicators, brand tier, number of rooms, price range, and market demand. We typically do this manually when we are doing it for one or two hotels, but doing this manually for 500 hotels is not feasible.

Once you have a problem to solve, the next step is to select with tool to use. Machine learning offers many tools. However, this problem can be solved with a simple family of algorithms: Nearest Neighbors.

The Nearest Neighbors algorithm family

The nearest neighbors algorithm family is a form of optimization problem that aims to find the points in a given data set that are the closest or most similar to a given point.

These algorithms have been very successful in tackling many classification and regression problems. As such, Scikit Learn API has a fantastic Nearest Neighbors module.

Choosing the right algorithm

Most people are familiar with K-Nearest Neighbor (KNN); however, Scikit Learn offers a wide variety of Nearest Neighbors algorithms, covering both supervised and unsupervised tasks.

For our problem, we don’t have any labels. Therefore, we are looking for an unsupervised algorithm.

If you look through the scikit learn documentation, you will find NearestNeighbors . This algorithm performs unsupervised learning for implementing neighbor searches.

This seems to cover what we need to solve our problem. Let’s start by getting the data ready and running a baseline model.

Baseline Model

1. Loading the data

A hotel’s performance usually depends on location, brand, and size. For our analysis, we use two data sets:

Hotel data: The hotel data used below has been generated artificially based on the original dataset used for this analysis.

  • BRAND: defines the service level of the hotel: Luxury, Upscale, Economy
  • Room_count: number of rooms available for sale
  • Market: Name of the city in which the hotel is located
  • Country: Name of the country
  • Latitude: Hotel’s Latitude location
  • Longitude: Hotel’s Longitude location
  • Airport Code: 3 Letter code of the nearest international airport
  • Market Tier: defines the market development level.
  • HCLASS: indicates if the hotel is a city hotel or resort
  • Demand: indicates hotel yearly occupancy
  • Price range: indicates the average price for the hotel

We also know how hotel performance can be impacted by accessibility. To measure accessibility, we can measure how far the hotel is from the main international airport. The airport data is from the World Bank: https://datacatalog.worldbank.org/search/dataset/0038117

  • Orig: 3 Letter Airport code
  • Name: Airport name
  • TotalSeats: Annual passenger volume
  • Country name: Airport country name
  • Airpot1Latitude: Aiport Latitude
  • Airport1Longitude: Airport Longitude

*Global Airports dataset is licensed under Creative Commons Attribution 4.0

Let’s import the data.

import pandas as pd
import numpy as np

data = pd.read_excel("mock_data.xlsx")
airport_data = pd.read_csv("airport_volume_airport_locations.csv")

Sample of Hotel data. Image by author
Sample Airport data. Image by author

As mentioned before, hotel performance is highly dependent on location. In our data set, we have many measures of location, such as Market Country… however, this is not always ideal as those definitions are quite broad. To narrow down similar locations, we need to create a accessability measure, defined by the distance to the closest international airport.

To calculate the distance from a hotel to the airport, we use the haversine formula. The haversine formula is used to calculate the distance between two points in a sphere, given their latitude and longitude.

# Below code is taken from geeksforgeeks
from math import radians, cos, sin, asin, sqrt

def distance_to_airport(lat, airport_lat, lon, airport_lon):

# Convert latitude and longitude values from decimal degrees to radians
lon = radians(lon)
airport_lon = radians(airport_lon)
lat = radians(lat)
airport_lat = radians(airport_lat)

# Haversine formula
dlon = airport_lon - lon
dlat = airport_lat - lat
a = sin(dlat / 2)**2 + cos(lat) * cos(airport_lat) * sin(dlon / 2)**2

c = 2 * asin(sqrt(a))

# Radius of earth in kilometers.
r = 6371

# return distance in KM
return(c * r)

#Apply the distance_to_airport functions to each hotel
data["distance_to_airport"] = data.apply(lambda row: distance_to_airport(row["Latitude"],row["Airport1Latitude"],row["Longitude"],row["Airport1Longitude"]),axis=1)
data.head()

Resulting data frame with distance to airport feature. Image by author

The next step is removing any column we won’t need for our model.

# Drop Columns that we dont need
# For the purpose of benchmarking we will keep the hotel feautures, and distance to airport
col_to_drop = ["Latitude","Longitude","Airport Code","Orig","Name","TotalSeats","Country Name","Airport1Latitude","Airport1Longitude"]

data_clean = data.drop(col_to_drop,axis=1)
data_clean.head()

Next, we encode all non-numerical variables so that we can pass them into our model. At this point, it is important to keep in mind that we will need the original labels to present our suggested groupings to the team and for ease of validation. To do so, we will store the encoding information in a dictionary.

from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object for each object column
brand_encoder = LabelEncoder()
market_encoder = LabelEncoder()
country_encoder = LabelEncoder()
market_tier_encoder = LabelEncoder()
hclass_encoder = LabelEncoder()

# Fit each LabelEncoder on the unique values of the corresponding column
data_clean['BRAND'] = brand_encoder.fit_transform(data_clean['BRAND'])
data_clean['Market'] = market_encoder.fit_transform(data_clean['Market'])
data_clean['Country'] = country_encoder.fit_transform(data_clean['Country'])
data_clean['Market Tier'] = market_tier_encoder.fit_transform(data_clean['Market Tier'])
data_clean['HCLASS']= hclass_encoder.fit_transform(data_clean['HCLASS'])

# create a dictionnary with all the encoders for reverse encoding
encoders ={"BRAND" : brand_encoder,
"Market": market_encoder,
"Country": country_encoder,
"Market Tier": market_tier_encoder,
"HCLASS": hclass_encoder
}

data_clean.head()

Encoded data. Image by author

Our data is now numerical, but as you can see the values in each column have very different ranges. To avoid the ranges of any features from disproportionately affecting our model, we need to rescale our data.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_clean)
data_scaled
Scaled data. Image by author

At this point, we are ready to generate a baseline model.

from sklearn.neighbors import NearestNeighbors

nns = NearestNeighbors()
nns.fit(data_scaled)
nns_results_model_0 = nns.kneighbors(data_scaled)[1]

nns_results_model_0

Model output. Image by author

The output of the model is a list of indexes, where the first index is the subject hotel, and the other indexes represent the nearest hotels.

To validate the model, we can visually inspect the results. We can create a function that takes in the list of indexes and decodes the values.

def clean_results(nns_results: np.ndarray,
encoders: dict,
data: pd.DataFrame):
"""
Returns a dataframe with a list of labels for each Nearest Neighobor group
"""
result = pd.DataFrame()

# 1. Get a list of Nearest Hotels based on our model
for i in range(len(nns_results)):

results = {} #empty dictionary to append each rows values

# Each row in nns_results contains the indexs of the selected nearest neighbors
# We use those index to get the Hotel names in our main data set
results["Hotels"] = list(data.iloc[nns_results[i]].index)

# 2. Get the values for each features for all Nearest Neighbors groups
for item in data_clean.columns:
results[item] = list(data.iloc[nns_results[i]][item])

# 3. Create a row for each Nearest Neighbor group and append to main DataFrame
df = pd.DataFrame([results])
result = pd.concat([result,df],axis=0)

# 4. Decode the labels to the encoded columns
for key, val in encoders.items():
result[key] = result[key].apply(lambda x : list(val.inverse_transform(x)))

result.reset_index(drop=True,inplace=True) # Reset the index for clarity
return result

results_model_0 = clean_results(nns_results=nns_results_model_0,
encoders=encoders,
data=data_clean)
results_model_0.head()

Initial benchmark groups. Image by author

Because we are using an unsupervised learning algorithm, there is not a widely available measure of accuracy. However, we can use domain knowledge to validate our groups.

Visually inspecting the groups, we can see some benchmarking groups have a mix of Economy and Luxury hotels, which doesn’t make business sense as the demand for hotels is fundamentally different.

We can scroll to the data and note some of those differences, but can we come up with our own accuracy measure?

We want to create a function to measure the consistency of the recommended Benchmarking sets across each feature. One way of doing this is by calculating the variance in each feature for each set. For each cluster, we can compute an average of each feature variance, and we can then average each hotel cluster variance to get a total model score.

From our domain knowledge, we know that in order to set up a comparable benchmark set, we need to prioritize hotels in the same Brand, possibly the same market, and the same country, and if we use different markets or countries, then the market tier should be the same.

With that in mind, we want our measure to have a higher penalty for variance in those features. To do so, we will use a weighted average to calculate each benchmark set variance. We will also print the variance of the key features and secondary features separately.

To sum up, to create our accuracy measure, we need to:

  1. Calculate variance for categorical variables: One common approach is to use an “entropy-based” measure, where higher diversity in categories indicates higher entropy (variance).
  2. Calculate variance for numerical variables: we can compute the standard deviation or the range (difference between maximum and minimum values). This measures the spread of numerical data within each cluster.
  3. Normalize the data: normalize the variance scores for each category before applying weights to ensure that no single feature dominates the weighted average due to scale differences alone.
  4. Apply weights for different metrics: Weight each type of variance based on its importance to the clustering logic.
  5. Calculating weighted averages: Compute the weighted average of these variance scores for each cluster.
  6. Aggregating scores across clusters: The total score is the average of these weighted variance scores across all clusters or rows. A lower average score would indicate that our model effectively groups similar hotels together, minimizing intra-cluster variance.
from scipy.stats import entropy
from sklearn.preprocessing import MinMaxScaler
from collections import Counter

def categorical_variance(data):
"""
Calculate entropy for a categorical variable from a list.
A higher entropy value indicates datas with diverse classes.
A lower entropy value indicates a more homogeneous subset of data.
"""
# Count frequency of each unique value
value_counts = Counter(data)
total_count = sum(value_counts.values())
probabilities = [count / total_count for count in value_counts.values()]
return entropy(probabilities)

#set scoring weights giving higher weights to the most important features
scoring_weights = {"BRAND": 0.3,
"Room_count": 0.025,
"Market": 0.25,
"Country": 0.15,
"Market Tier": 0.15,
"HCLASS": 0.05,
"Demand": 0.025,
"Price range": 0.025,
"distance_to_airport": 0.025}

def calculate_weighted_variance(df, weights):
"""
Calculate the weighted variance score for clusters in the dataset
"""
# Initialize a DataFrame to store the variances
variance_df = pd.DataFrame()

# 1. Calculate variances for numerical features
numerical_features = ['Room_count', 'Demand', 'Price range', 'distance_to_airport']
for feature in numerical_features:
variance_df[f'{feature}'] = df[feature].apply(np.var)

# 2. Calculate entropy for categorical features
categorical_features = ['BRAND', 'Market','Country','Market Tier','HCLASS']
for feature in categorical_features:
variance_df[f'{feature}'] = df[feature].apply(categorical_variance)

# 3. Normalize the variance and entropy values
scaler = MinMaxScaler()
normalized_variances = pd.DataFrame(scaler.fit_transform(variance_df),
columns=variance_df.columns,
index=variance_df.index)

# 4. Compute weighted average

cat_weights = {f'{feature}': weights[f'{feature}'] for feature in categorical_features}
num_weights = {f'{feature}': weights[f'{feature}'] for feature in numerical_features}

cat_weighted_scores = normalized_variances[categorical_features].mul(cat_weights)
df['cat_weighted_variance_score'] = cat_weighted_scores.sum(axis=1)

num_weighted_scores = normalized_variances[numerical_features].mul(num_weights)
df['num_weighted_variance_score'] = num_weighted_scores.sum(axis=1)

return df['cat_weighted_variance_score'].mean(), df['num_weighted_variance_score'].mean()

To keep our code clean and track our experiments , let’s also define a function to store the results of our experiments.

# define a function to store the results of our experiments
def model_score(data: pd.DataFrame,
weights: dict = scoring_weights,
model_name: str ="model_0"):
cat_score,num_score = calculate_weighted_variance(data,weights)
results ={"Model": model_name,
"Primary features score": cat_score,
"Secondary features score": num_score}
return results

model_0_score= model_score(results_model_0,scoring_weights)
model_0_score

Baseline model results.

Now that we have a baseline, let’s see if we can improve our model.

Improving our Model Through Experimentation

Up until now, we did not have to know what was going on under the hood when we ran this code:

nns = NearestNeighbors()
nns.fit(data_scaled)
nns_results_model_0 = nns.kneighbors(data_scaled)[1]

To improve our model, we will need to understand the model parameters and how we can interact with them to get better benchmark sets.

Let’s start by looking at the Scikit Learn documentation and source code:

# the below is taken directly from scikit learn source

from sklearn.neighbors._base import KNeighborsMixin, NeighborsBase, RadiusNeighborsMixin

class NearestNeighbors_(KNeighborsMixin, RadiusNeighborsMixin, NeighborsBase):
"""Unsupervised learner for implementing neighbor searches.
Parameters
----------
n_neighbors : int, default=5
Number of neighbors to use by default for :meth:`kneighbors` queries.

radius : float, default=1.0
Range of parameter space to use by default for :meth:`radius_neighbors`
queries.

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
Algorithm used to compute the nearest neighbors:

- 'ball_tree' will use :class:`BallTree`
- 'kd_tree' will use :class:`KDTree`
- 'brute' will use a brute-force search.
- 'auto' will attempt to decide the most appropriate algorithm
based on the values passed to :meth:`fit` method.

Note: fitting on sparse input will override the setting of
this parameter, using brute force.

leaf_size : int, default=30
Leaf size passed to BallTree or KDTree. This can affect the
speed of the construction and query, as well as the memory
required to store the tree. The optimal value depends on the
nature of the problem.

metric : str or callable, default='minkowski'
Metric to use for distance computation. Default is "minkowski", which
results in the standard Euclidean distance when p = 2. See the
documentation of `scipy.spatial.distance
<https://docs.scipy.org/doc/scipy/reference/spatial.distance.html>`_ and
the metrics listed in
:class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric
values.

p : float (positive), default=2
Parameter for the Minkowski metric from
sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric_params : dict, default=None
Additional keyword arguments for the metric function.
"""

def __init__(
self,
*,
n_neighbors=5,
radius=1.0,
algorithm="auto",
leaf_size=30,
metric="minkowski",
p=2,
metric_params=None,
n_jobs=None,
):
super().__init__(
n_neighbors=n_neighbors,
radius=radius,
algorithm=algorithm,
leaf_size=leaf_size,
metric=metric,
p=p,
metric_params=metric_params,
n_jobs=n_jobs,
)

There are quite a few things going on here.

The Nearestneighbor class inherits fromNeighborsBase, which is the case class for nearest neighbor estimators. This class handles the common functionalities required for nearest-neighbor searches, such as

  • n_neighbors (the number of neighbors to use)
  • radius (the radius for radius-based neighbor searches)
  • algorithm (the algorithm used to compute the nearest neighbors, such as ‘ball_tree’, ‘kd_tree’, or ‘brute’)
  • metric (the distance metric to use)
  • metric_params (additional keyword arguments for the metric function)

The Nearestneighbor class also inherits fromKNeighborsMixin and RadiusNeighborsMixinclasses. These Mixin classes add specific neighbor-search functionalities to the Nearestneighbor

  • KNeighborsMixin provides functionality to find the nearest fixed number k of neighbors to a point. It does that by finding the distance to the neighbors and their indices and constructing a graph of connections between points based on the k-nearest neighbors of each point.
  • RadiusNeighborsMixin is based on the radius neighbors algorithm, which finds all neighbors within a given radius of a point. This method is useful in scenarios where the focus is on capturing all points within a meaningful distance threshold rather than a fixed number of points.

Based on our scenario, KNeighborsMixin provides the functionality we need.

We need to understand one key parameter before we can improve our model; this is the distance metric.

A quick introduction to distance

The documentation mentions that the NearestNeighbor algorithm uses the “Minkowski” distance by default and gives us a reference to the SciPy API.

In scipy.spatial.distance, we can see two mathematical representations of “Minkowski” distance:

∥u−v∥ p​=( i ∑​∣u i​−v i​∣ p ) 1/p

This formula calculates the p-th root of the sum of powered differences across all elements.

The second mathematical representation of “Minkowski” distance is:

∥u−v∥ p​=( i ∑​w i​(∣u i​−v i​∣ p )) 1/p

This is very similar to the first one, but it introduces weights wi to the differences, emphasizing or de-emphasizing specific dimensions. This is useful where certain features are more relevant than others. By default, the setting is None, which gives all features the same weight of 1.0.

This is a great option for improving our model as it allows us to pass domain knowledge to our model and emphasize similarities that are most relevant to users.

If we look at the formulas, we see the parameter. p. This parameter affects the “path” the algorithm takes to calculate the distance. By default, p=2, which represents the Euclidian distance.

You can think of the Euclidian distance as calculating the distance by drawing a straight line between 2 points. This is usally the shortest distance, however, this is not always the most desirable way of calculating the distance, specially in higher dimention spaces. For more information on why this is the case, there is this great paper online: https://bib.dbvis.de/uploadedFiles/155.pdf

Another common value for p is 1. This represents the Manhattan distance. You think of it as the distance between two points measured along a grid-like path.

On the other hand, if we increase p towards infinity, we end up with the Chebyshev distance, defined as the maximum absolute difference between any corresponding elements of the vectors. It essentially measures the worst-case difference, making it useful in scenarios where you want to ensure that no single feature varies too much.

By reading and getting familiar with the documentation, we have uncovered a few possible options to improve our model.

Experiment 1: Baseline model with n_neighbors = 4

By default n_neighbors is 5, however, for our benchmark set, we want to compare each hotel to the 3 most similar hotels. To do so, we need to set n_neighbors = 4 (Subject hotel + 3 peers)

nns_1= NearestNeighbors(n_neighbors=4)
nns_1.fit(data_scaled)
nns_1_results_model_1 = nns_1.kneighbors(data_scaled)[1]
results_model_1 = clean_results(nns_results=nns_1_results_model_1,
encoders=encoders,
data=data_clean)
model_1_score= model_score(results_model_1,scoring_weights,model_name="baseline_k_4")
model_1_score
Slight improvement in our primary features. Image by author

Experiment 2: adding weights

Based on the documentation, we can pass weights to the distance calculation to emphasize the relationship across some features. Based on our domain knowledge, we have identified the features we want to emphasize, in this case, Brand, Market, Country, and Market Tier.

# set up weights for distance calculation
weights_dict = {"BRAND": 5,
"Room_count": 2,
"Market": 4,
"Country": 3,
"Market Tier": 3,
"HCLASS": 1.5,
"Demand": 1,
"Price range": 1,
"distance_to_airport": 1}
# Transform the wieghts dictionnary into a list by keeping the scaled data column order
weights = [ weights_dict[idx] for idx in list(scaler.get_feature_names_out())]

nns_2= NearestNeighbors(n_neighbors=4,metric_params={ 'w': weights})
nns_2.fit(data_scaled)
nns_2_results_model_2 = nns_2.kneighbors(data_scaled)[1]
results_model_2 = clean_results(nns_results=nns_2_results_model_2,
encoders=encoders,
data=data_clean)
model_2_score= model_score(results_model_2,scoring_weights,model_name="baseline_with_weights")
model_2_score

Primary features score keeps improving. Image by author

Passing domain knowledge to the model via weights increased the score significantly. Next, let’s test the impact of the distance measure.

Experiment 3: use Manhattan distance

So far, we have been using the Euclidian distance. Let’s see what happens if we use the Manhattan distance instead.

nns_3= NearestNeighbors(n_neighbors=4,p=1,metric_params={ 'w': weights})
nns_3.fit(data_scaled)
nns_3_results_model_3 = nns_3.kneighbors(data_scaled)[1]
results_model_3 = clean_results(nns_results=nns_3_results_model_3,
encoders=encoders,
data=data_clean)
model_3_score= model_score(results_model_3,scoring_weights,model_name="Manhattan_with_weights")
model_3_score
Significant decrease in primary score. image by author

Experiment 4: use Chebyshev distance

Decreasing p to 1 resulted in some good improvements. Let’s see what happens as p approximates infinity.

To use the Chebyshev distance, we will change the metric parameter to Chebyshev. The default sklearn Chebyshev metric doesn’t have a weight parameter. To get around this, we will define a custom weighted_chebyshev metric.

#  Define the custom weighted Chebyshev distance function
def weighted_chebyshev(u, v, w):
"""Calculate the weighted Chebyshev distance between two points."""
return np.max(w * np.abs(u - v))

nns_4 = NearestNeighbors(n_neighbors=4,metric=weighted_chebyshev,metric_params={ 'w': weights})
nns_4.fit(data_scaled)
nns_4_results_model_4 = nns_4.kneighbors(data_scaled)[1]
results_model_4 = clean_results(nns_results=nns_4_results_model_4,
encoders=encoders,
data=data_clean)
model_4_score= model_score(results_model_4,scoring_weights,model_name="Chebyshev_with_weights")
model_4_score

Better than the baseline but higher than the previous experiment. Image by author

We managed to decrease the primary feature variance scores through experimentation.

Let’s visualize the results.

results_df = pd.DataFrame([model_0_score,model_1_score,model_2_score,model_3_score,model_4_score]).set_index("Model")
results_df.plot(kind='barh')
Experimentation results. Image by author

Using Manhattan distance with weights seems to give the most accurate benchmark sets according to our needs.

The last step before implementing the benchmark sets would be to examine the sets with the highest Primary features scores and identify what steps to take with them.

# Histogram of Primary features score
results_model_3["cat_weighted_variance_score"].plot(kind="hist")
Score distribution. Image by author
exceptions = results_model_3[results_model_3["cat_weighted_variance_score"]>=0.4]

print(f" There are {exceptions.shape[0]} benchmark sets with significant variance across the primary features")

Image by author

These 18 cases will need to be reviewed to ensure the benchmark sets are relevant.

As you can see, with a few lines of code and some understanding of Nearest neighbor search, we managed to set internal benchmark sets. We can now distribute the sets and start measuring hotels’ KPIs against their benchmark sets.

You don’t always have to focus on the most cutting-edge machine learning methods to deliver value. Very often, simple machine learning can deliver great value.

What are some low-hanging fruits in your business that you could easily tackle with Machine learning?

REFERENCES

World Bank. “World Development Indicators.” Retrieved June 11, 2024, from https://datacatalog.worldbank.org/search/dataset/0038117

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (n.d.). On the Surprising Behavior of Distance Metrics in High Dimensional Space. IBM T. J. Watson Research Center and Institute of Computer Science, University of Halle. Retrieved from https://bib.dbvis.de/uploadedFiles/155.pdf

SciPy v1.10.1 Manual. scipy.spatial.distance.minkowski. Retrieved June 11, 2024, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html

GeeksforGeeks. Haversine formula to find distance between two points on a sphere. Retrieved June 11, 2024, from https://www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/

scikit-learn. Neighbors Module. Retrieved June 11, 2024, from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors