Leveraging AWS’s MLOps platform to serve your LLM models
In any machine learning project, the goal is to train a model that can be used by others to derive a good prediction. To do that, the model needs to be served for inference. Several parts in this workflow require this inference endpoint, namely, for model evaluation, before releasing it to the development, staging, and finally production environment for the end-users to consume.
In this article, I will demonstrate how to deploy the latest LLM and serving technologies, namely Llama and vLLM, using AWS’s SageMaker endpoint and its DJL image. What are these components and how do they make up an inference endpoint?
SageMaker is an AWS service that consists of a large suite of tools and services to manage a machine learning lifecycle. Its inference service is known as SageMaker endpoint. Under the hood, it is essentially a virtual machine self-managed by AWS.
DJL (Deep Java Library) is an open-source library developed by AWS used to develop LLM inference docker images, including vLLM [2]. This image is used in…