Algorithm-Agnostic Model Building with Mlflow

A beginner-friendly step-by-step guide to creating generic ML pipelines using mlflow.pyfunc

8 min read

11 hours ago

One common challenge in MLOps is the hassle of migrating between various algorithms or frameworks. This beginner-friendly article helps you tackle the challenge by leveraging algorithm-agnostic model building using mlflow.pyfunc.

Why Agorithm-Agonostic Model Building?

Consider this scenario: we have an sklearn model currently deployed in production for a particular use case. Later on, we find that a deep learning model performs even better. If the sklearn model was deployed in its native format, transitioning to the deep learning model could be a hassle 🤪 because the two model artifacts are very different.

Image generated by prompting Gemini

To address such a challenge, the mlflow.pyfunc model flavor provides a versatile and generic approach to building and deploying machine learning models in Python. 😎

1. Generic Model Building: The pyfunc model flavor offers a uniform way to build models, regardless of the framework or library used for the build.

2. Encapsulation of the ML Pipeline: pyfunc allows us to encapsulate the model with its pre- and post-processing steps or other custom logic desirable during model consumption.

3. Unified Model Representation: We can deploy a model, a machine learning pipeline, or any python function using pyfunc without worrying about the model’s underlying format. Such a unified representation simplifies model deployment, redeployment, and downstream scoring.

Sounds interesting? If yes, this article is here to get you started with mlflow.pyfunc. 🥂

  • Firstly, let’s go through a simple toy example of creating mlflow.pyfunc class.
  • Then, we will define a mlflow.pyfunc class that encapsulates a machine learning pipeline (an estimator plus some preprocessing logic as an example). We will also train, log and load this ML pipeline for inference.
  • Lastly, let’s take a deep dive into the encapsulated mlflow.pyfunc object, explore the rich metadata and artifacts automatically tracked for us by mlflow, and get a better grasp of the full power that mlflow.pyfunc offers.

🔗 All code and config are available on GitHub. 🧰

{pyfunc} Simple Toy Model

First, let’s create a simple toy mlflow.pyfunc model and then use it with the mlflow workflow.

  • Step 1: Create the model
  • Step 2: Log the model
  • Step 3: Load the logged model to perform the inference
# Step 1: Create a mlflow.pyfunc model
class ToyModel(mlflow.pyfunc.PythonModel):
"""
ToyModel is a simple example implementation of an MLflow Python model.
"""

def predict(self, context, model_input):
"""
A basic predict function that takes a model_input list and returns a new list
where each element is increased by one.

Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (list of int or float): A list of numerical values that the model will use for prediction.

Returns:
- list of int or float: A list with each element in model_input is increased by one.
"""
return [x + 1 for x in model_input]

As you can see from the example above, you can create an mlflow.pyfunc model to implement any customed Python function you see fit for your ML solution, which doesn’t have to be an off-the-shelf machine learning algorithm.

You can then log this model and load it later to perform the inference.

# Step 2: log this model as an mlflow run
with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path = "model",
python_model=ToyModel()
)
run_id = mlflow.active_run().info.run_id
# Step 3: load the logged model to perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
# dummy new data
x_new = [1,2,3]
# model inference for the new data
print(model.predict(x_new))
[2, 3, 4]

{pyfunc} Encapsulated XGBoost Pipeline

Now, let’s create an ML pipeline encapsulating an estimator with additional custom logic.

In the example below, the XGB_PIPELINE class is a wrapper that integrates the estimator with preprocessing steps, which can be desirable for some MLOps implementations. Leveraging mlflow.pyfunc, this wrapper is estimator-agnostic and offers a uniform model representation. Specifically,

  • fit(): Instead of using XGBoost’s native API (xgboost.train()), this class uses .fit(), which adheres to sklearn conventions, enabling straightforward integration into sklearn pipelines and ensuring consistency across different estimators.
  • DMatrix(): DMatrix is a core data structure in XGBoost that optimizes data for training and prediction. In this class, the step to transform a pandas DataFrame into a DMatrix is wrapped within the class, enabling seamless integration with pandas DataFrames like all other sklearn estimators.
  • predict() : This is the mlflow.pyfunc model’s universal inference API. It is consistent for this ML pipeline, for the toy model above, for any machine learning algorithms or custom logic we wrap in an mlflow.pyfunc model.
import json
import xgboost as xgb
import mlflow.pyfunc
from typing import Any, Dict, Union
import pandas as pd

class XGB_PIPELINE(mlflow.pyfunc.PythonModel):
"""
XGBWithPreprocess is an example implementation of an MLflow Python model with XGBoost.
"""

def __init__(self, params: Dict[str, Union[str, int, float]]):
"""
Initialize the model with given parameters.

Parameters:
- params (Dict[str, Union[str, int, float]]): Parameters for the XGBoost model.
"""
self.params = params
self.xgb_model = None
self.config = None

def preprocess_input(self, model_input: pd.DataFrame) -> pd.DataFrame:
"""
Preprocess the input data.

Parameters:
- model_input (pd.DataFrame): The input data to preprocess.

Returns:
- pd.DataFrame: The preprocessed input data.
"""
processed_input = model_input.copy()
# put any desired preprocessing logic here
processed_input.drop(processed_input.columns[0], axis=1, inplace=True)

return processed_input

def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
"""
Train the XGBoost model.

Parameters:
- X_train (pd.DataFrame): The training input data.
- y_train (pd.Series): The target values.
"""
processed_model_input = self.preprocess_input(X_train.copy())
dtrain = xgb.DMatrix(processed_model_input, label=y_train)
self.xgb_model = xgb.train(self.params, dtrain)

def predict(self, context: Any, model_input: pd.DataFrame) -> Any:
"""
Predict using the trained XGBoost model.

Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (pd.DataFrame): The input data for making predictions.

Returns:
- Any: The prediction results.
"""
processed_model_input = self.preprocess_input(model_input.copy())
dmatrix = xgb.DMatrix(processed_model_input)
return self.xgb_model.predict(dmatrix)

Now, let’s train and log this model.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd

# Generate synthetic datasets for demo
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train and log the model
with mlflow.start_run(run_name = 'xgb_demo') as run:

# Create an instance of XGB_PIPELINE
params = {
'objective': 'reg:squarederror',
'max_depth': 3,
'learning_rate': 0.1,
}
model = XGB_PIPELINE(params)

# Fit the model
model.fit(X_train=pd.DataFrame(X_train), y_train=y_train)

# Log the model
model_info = mlflow.pyfunc.log_model(
artifact_path = 'model',
python_model = model,
)

run_id = mlflow.active_run().info.run_id

The model has been logged successfully. ✌ ️Now, let’s load it for inference-making.

loaded_model = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)
loaded_model.predict(pd.DataFrame(X_test))
array([ 4.11692047e+00,  7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
...

Deep Dive into the Mlflow.pyfunc Object

The above process is pretty smooth, isn’t it? This represents the basic functionality of the mlflow.pyfunc object. Now, let’s dive deeper to explore the full power that mlflow.pyfunc has to offer.

1. model_info

In the example above, the model_info object returned by mlflow.pyfunc.log_model() is an instance of mlflow.models.model.ModelInfo class. It contains metadata and information about the logged model. For example

Some attributes of the model_info object

Feel free to run dir(model_info) to explore further or check out the source code for all the attributes defined. The attribute I use the most is model_uri, which indicates where the logged model can be found within the mlflow tracking system.

2. loaded_model

It is worthwhile clarifying that the loaded_model is not an instance of the XGB_PIPELINE class, but rather a wrapper object provided by mlflow.pyfunc for algorithm-agnostic inference making. As shown below, an error will be returned if you attempt to retrieve attributes of the XGB_PIPELINE class from the loaded_model.

print(loaded_model.params)
AttributeError: 'PyFuncModel' object has no attribute 'params'

3. unwrapped_model

All right, you may ask, then where is the trained instance of XGB_PIPELINE? Is it logged and retrievable through mlflow, too?

Don’t worry; it is kept safe for you to unwrap easily, as shown below.

unwrapped_model = loaded_model.unwrap_python_model()
print(unwrapped_model.params)
{'objective': 'reg:squarederror', 'max_depth': 3, 'learning_rate': 0.1}

That’s how it is done. 😎 With the unwrapped_model, you can access any properties or methods of your custom ML pipeline just like this! I sometimes add handy methods such as explain_model or post_processing in the custom pipeline, or include helpful attributes to trace the model training process and offer diagnostics 🤩… Well, I’d better stop here and leave those for the following articles. Suffice it to say, you can feel free to custom your ML pipeline for your use case and know that

  1. You will have access to all these tailor-made methods and attributes for downstream use and
  2. This tailor-made custom model will be wrapped within the uniform mlflow.pyfunc inference API and hence enjoy a smooth migration to other estimators if necessary.

4. Context

You may have noticed that there is a context parameter for the predict methods in both mlflow.pyfunc class defined above. But interestingly, this parameter is not required when we make predictions with the loaded model. Why❓

loaded_model = mlflow.pyfunc.load_model(model_uri)
# the context parameter is not needed when calling `predict`
loaded_model.predict(model_input)

This is because the loaded_model above is a wrapper object provided by mlflow. If we use the unwrapped model instead, we need to define the context explicitly, as shown below; otherwise, the code will return an error.

unwrapped_model = loaded_model.unwrap_python_model()
# need to provide context mannually
unwrapped_model.predict(context=None, model_input)

So, what is this context? And what role does it play in the predict method?

The context is a PythonModelContext object that contains artifacts thepyfunc model can use when performing inference. It is created implicitly and automatically by the log_method() method.

Navigate to the mlruns subfolder in your project repo, which is automatically created by mlflow when you log an mlflow model. Find the folder named after the model’s run_id. Inside, you’ll find the model artifacts automatically logged for you, as shown below.

# get run_id of a loaded model
print(loaded_model.metadata.run_id)
38a617d0f30645e8ae95eea4642a03c2
artifacts folder in a logged `mlflow.pyfunc` model

Pretty neat, isn’t it?😁 Feel free to explore these artifacts at your leisure; below are the screenshots of the requirements and MLmodel file from the folder FYR.

The requiarements below specifies the versions of dependencies required to recreate the environment for running the model.

The `requirements.txt` file in the artifacts folder

The MLmodel doc below defines the metadata and configuration necessary to load and serve the model in YAML format.

The `MLmodel` file in the artifacts folder

Conclusion

There you have it, the mlflow.pyfunc approach to model building. It is a lot of information, so let’s recap

  1. mlflow.pyfunc offers a unified model representation unaffected by the underlying framework or libraries used to build the model.
  2. We can even encapsulate rich custom logic into a mlflow.pyfunc model to tailor each use case while keeping the inference API consistent and unified.
  3. The underlying model can be unwrapped from the loaded mlflow.pyfunc model, allowing us to leverage more custom methods/attributes tailored for each use case.
  4. An mlflow.pyfunc model object is logged with rich metadata and artifacts that are automatically tracked by mlflow.
  5. This unified mlflow.pyfunc model representation can streamline the process of experimenting and migrating between different algorithms to achieve optimal performance (more on this in the following articles, pls see below)

Next Steps

Now we have got the basics sorted, in the following articles, we can continue to discuss more advanced usage of mlflow.pyfunc. 😎 Below are some topics from the top of my head; feel free to leave a comment and let me know what you would like to see. 🥰

  1. Leverage the uniform API to experiment with various algorithms and identify the optimal solution for a use case.
  2. Hyperparameter tuning with mlflow.pyfunc custom models.
  3. Encapsulating custom logic into an mlflow.pyfunc ML pipeline to tailor model consumption and diagnostics.

If you enjoyed reading this article, follow me on Medium. 😁

💼LinkedIn | 😺GitHub | 🕊️Twitter/X

Unless otherwise noted, all images are by the author.