Benjamin Marie

AI

Torch Compile: 2x Faster Llama 3.2 with Low Effort

But it will depend on your GPU Benjamin Marie · Follow Published in Towards Data Science · 5 min read · 12 hours ago — Image generated with ChatGPT Torch Compile (torch.compile) was first introduced with PyTorch 2.0, but it took several updates and optimizations before it could reliably support most large language models (LLMs). when it comes to inference, torch.compile can genuinely speed up decoding with only a small increase in memory usage. In

Read More »
AI

Fixing Faulty Gradient Accumulation: Understanding the Issue and Its Resolution

Years of suboptimal model training? Benjamin Marie · Follow Published in Towards Data Science · 10 min read · 22 hours ago — Image by the author When fine-tuning large language models (LLMs) locally, using large batch sizes is often impractical due to their substantial GPU memory consumption. To overcome this limitation, a technique called gradient accumulation is commonly used to simulate larger batch sizes. Instead of updating the model weights after processing each batch,

Read More »
AI

Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Understanding how much memory you need to serve a VLM Benjamin Marie · Follow Published in Towards Data Science · 10 min read · 7 hours ago — An image encoded by Pixtral — Image by the author vLLM is currently one of the fastest inference engines for large language models (LLMs). It supports a wide range of model architectures and quantization methods. vLLM also supports vision-language models (VLMs) with multimodal inputs containing both images

Read More »
AI

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Fast and accurate GGUF models for your CPU Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 1 hour ago — Generated with DALL-E GGUF is a binary file format designed for efficient storage and fast large language model (LLM) loading with GGML, a C-based tensor library for machine learning. GGUF encapsulates all necessary components for inference, including the tokenizer and code, within a single file. It supports the conversion

Read More »
AI

Serve Multiple LoRA Adapters with vLLM

Without any increase in latency Benjamin Marie · Follow Published in Towards Data Science · 6 min read · 12 hours ago — Generated with DALL-E With a LoRA adapter, we can specialize a large language model (LLM) for a task or a domain. The adapter must be loaded on top of the LLM to be used for inference. For some applications, it might be useful to serve users with multiple adapters. For instance, one

Read More »
AI

AutoRound: Accurate Low-bit Quantization for LLMs

Between quantization-aware training and post-training quantization Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 11 hours ago — Generated with DALL-E There are many quantization methods to reduce the size of large language models (LLM). Recently, better low-bit quantization methods have been proposed. For instance, AQLM achieves 2-bit quantization while preserving most of the model’s accuracy.

Read More »
AI

Quantize Llama 3 8B with Bitsandbytes to Preserve Its Accuracy

Llama 2 vs. Llama 3 vs. Mistral 7B, quantized with GPTQ and Bitsandbytes Benjamin Marie · Follow Published in Towards Data Science · 6 min read · 7 hours ago — Generated with DALL-E With quantization, we can reduce the size of large language models (LLMs). Quantized LLMs are easier to run on GPUs with smaller memory, effectively serving as a compression method for LLMs.

Read More »
AI

Turn Llama 3 into an Embedding Model with LLM2Vec

RAG with Llama 3 for the generation and the retrieval Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 7 hours ago — Generated with DALL-E The embedding model is a critical component of retrieval-augmented generation (RAG) for large language models (LLMs). They encode the knowledge base and the query written by the user. Using an embedding model trained or fine-tuned for the same domain as the LLM can…

Read More »
AI

ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

A much cheaper alignment method performing as well as DPO Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 20 hours ago — Generated with DALL-E There are now many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one of the first and brought us ChatGPT, but RLHF is very costly. DPO, IPO, and KTO are notably…

Read More »
AI

Marlin: Nearly Ideal Inference Speed for 4-bit Large Language Models

Up to 4x faster than inference with fp16 parameters Benjamin Marie · Follow Published in Towards Data Science · 6 min read · 12 hours ago — Generated with DALL-E Large language models (LLMs) are often too large to be directly used on consumer hardware. To reduce their size, various techniques have been proposed to quantize LLMs and lower their memory consumption. While recent algorithms for 4-bit quantization are often released along with their own

Read More »