Deploying Your Llama Model via vLLM using SageMaker Endpoint

Leveraging AWS’s MLOps platform to serve your LLM models

Instances in an MLOps workflow that require an inference endpoint (created by author).

In any machine learning project, the goal is to train a model that can be used by others to derive a good prediction. To do that, the model needs to be served for inference. Several parts in this workflow require this inference endpoint, namely, for model evaluation, before releasing it to the development, staging, and finally production environment for the end-users to consume.

In this article, I will demonstrate how to deploy the latest LLM and serving technologies, namely Llama and vLLM, using AWS’s SageMaker endpoint and its DJL image. What are these components and how do they make up an inference endpoint?

How each of these components together serves the model in AWS. SageMaker endpoint is the GPU instance, DJL is the template Docker image, and vLLM is the model server (created by author).

SageMaker is an AWS service that consists of a large suite of tools and services to manage a machine learning lifecycle. Its inference service is known as SageMaker endpoint. Under the hood, it is essentially a virtual machine self-managed by AWS.

DJL (Deep Java Library) is an open-source library developed by AWS used to develop LLM inference docker images, including vLLM [2]. This image is used in…