Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As tech companies race to deliver on-device AI, we are seeing a growing body of research and techniques for creating small language models (SLMs) that can run on resource-constrained devices.
The latest models, created by a research team at Nvidia, leverage recent advances in pruning and distillation to create Llama-3.1-Minitron 4B, a compressed version of the Llama 3 model. This model rivals the performance of both larger models and equally sized SLMs while being significantly more efficient to train and deploy.
The power of pruning and distillation
Pruning and distillation are two key techniques for creating smaller, more efficient language models. Pruning involves removing less important components of a model. “Depth pruning” removes complete layers while “width pruning” drops specific elements such as neurons and attention heads.
Model distillation is a technique that transfers knowledge and capabilities from a large model—often called the “teacher model”—to a smaller, simpler “student model.” There are two main ways to do distillation. First is “SGD training,” where the student model is trained on the inputs and responses of the teacher. Another method is “classical knowledge distillation,” where in addition to the results, the student is trained on the inner activations of the teacher model.
In a previous study, Nvidia researchers demonstrated the effectiveness of combining pruning with classical knowledge distillation. They started with the Nemotron 15B model and progressively pruned and distilled it down to an 8-billion parameter model. They then performed a light retraining procedure using model distillation with the original model as the teacher and the pruned model as the student. Finally, they repeated the process with the 8B model as the starting point to create a smaller 4B model.
This approach resulted in a 16% improvement in performance on the popular MMLU benchmark compared to training a 4-billion parameter model from scratch. Impressively, the entire process required 40X fewer tokens than training the model from scratch. The model’s performance was comparable to Mistral 7B, Gemma 7B, and Llama-3 8B, which were trained on trillions of tokens.
Distilling Llama 3.1
Building on their previous work, the Nvidia team decided to apply the same techniques to the Llama 3.1 8B model. Their goal was to create a 4-billion parameter version of the model that could match the performance of larger models while being more efficient to train.
The first step was to fine-tune the unpruned 8B model on a 94-billion-token dataset to correct for the distribution shift between the original model’s training data and their distillation dataset.
“Experiments showed that, without correcting for the distribution shift, the teacher provides suboptimal guidance on the dataset when being distilled,” the researchers write in a blog post.
Next, the researchers applied two types of pruning: depth-only pruning, where they removed 50% of the layers, and width-only pruning, where they removed 50% of the neurons from some of the dense layers in the transformer blocks. This resulted in two different versions of the Llama-3.1-Minitron 4B model.
Finally, the researchers fine-tuned the pruned models using NeMo-Aligner, a toolkit that supports various alignment algorithms such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO) and Nvidia’s own SteerLM.
The researchers evaluated the Llama-3.1-Minitron 4B models on abilities in instruction following, roleplay, retrieval-augmented generation (RAG), and function-calling.
The results showed that despite its small training corpus, Llama-3.1-Minitron 4B performs close to other SLMs, including Phi-2 2.7B, Gemma2 2.6B, Qwen2-1.5B. While Llama-3.1-Minitron 4B is at least 50% larger than those models, it has been trained on a fraction of the training data. This provides an interesting new dynamic to balance between the costs of training and inference.
The team has released the width-pruned version of the model on Hugging Face under the Nvidia Open Model License, which allows for commercial use. This makes it accessible to a wider range of users and developers who can benefit from its efficiency and performance.
“Pruning and classical knowledge distillation is a highly cost-effective method to progressively obtain LLMs [large language models] of smaller size, achieving superior accuracy compared to training from scratch across all domains,” the researchers wrote. “It serves as a more effective and data-efficient approach compared to either synthetic-data-style fine-tuning or pretraining from scratch.”
This work is a reminder of the value and importance of the open-source community to the progress of AI. Pruning and distillation are part of a wider body of research that is enabling companies to optimize and customize LLMs at a fraction of the normal cost. Other notable works in the field include Sakana AI’s evolutionary model-merging algorithm, which makes it possible to assemble parts of different models to combine their strengths without the need for expensive training resources.
VB Daily
Stay in the know! Get the latest news in your inbox daily
By subscribing, you agree to VentureBeat’s Terms of Service.
Thanks for subscribing. Check out more VB newsletters here.
An error occured.