Benjamin Marie, Author at Future Tech Stocks

2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% Accuracy

Member-only story Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU Benjamin Marie · Follow Published in Towards Data Science · 8 min read · 5 hours ago — Generated with ChatGPT Recent developments in low-bit quantization for LLMs, like AQLM and AutoRound, are now showing acceptable levels of degradation in downstream tasks, especially for large models. That said, 2-bit quantization still introduces noticeable accuracy loss in most cases. One promising

Benjamin Marie January 31, 2025

Torch Compile: 2x Faster Llama 3.2 with Low Effort

But it will depend on your GPU Benjamin Marie · Follow Published in Towards Data Science · 5 min read · 12 hours ago — Image generated with ChatGPT Torch Compile (torch.compile) was first introduced with PyTorch 2.0, but it took several updates and optimizations before it could reliably support most large language models (LLMs). when it comes to inference, torch.compile can genuinely speed up decoding with only a small increase in memory usage. In

Benjamin Marie November 13, 2024

Fixing Faulty Gradient Accumulation: Understanding the Issue and Its Resolution

Years of suboptimal model training? Benjamin Marie · Follow Published in Towards Data Science · 10 min read · 22 hours ago — Image by the author When fine-tuning large language models (LLMs) locally, using large batch sizes is often impractical due to their substantial GPU memory consumption. To overcome this limitation, a technique called gradient accumulation is commonly used to simulate larger batch sizes. Instead of updating the model weights after processing each batch,

Benjamin Marie October 23, 2024

Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Understanding how much memory you need to serve a VLM Benjamin Marie · Follow Published in Towards Data Science · 10 min read · 7 hours ago — An image encoded by Pixtral — Image by the author vLLM is currently one of the fastest inference engines for large language models (LLMs). It supports a wide range of model architectures and quantization methods. vLLM also supports vision-language models (VLMs) with multimodal inputs containing both images

Benjamin Marie September 23, 2024

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Fast and accurate GGUF models for your CPU Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 1 hour ago — Generated with DALL-E GGUF is a binary file format designed for efficient storage and fast large language model (LLM) loading with GGML, a C-based tensor library for machine learning. GGUF encapsulates all necessary components for inference, including the tokenizer and code, within a single file. It supports the conversion

Benjamin Marie September 13, 2024

Serve Multiple LoRA Adapters with vLLM

Without any increase in latency Benjamin Marie · Follow Published in Towards Data Science · 6 min read · 12 hours ago — Generated with DALL-E With a LoRA adapter, we can specialize a large language model (LLM) for a task or a domain. The adapter must be loaded on top of the LLM to be used for inference. For some applications, it might be useful to serve users with multiple adapters. For instance, one

Benjamin Marie August 3, 2024

AutoRound: Accurate Low-bit Quantization for LLMs

Between quantization-aware training and post-training quantization Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 11 hours ago — Generated with DALL-E There are many quantization methods to reduce the size of large language models (LLM). Recently, better low-bit quantization methods have been proposed. For instance, AQLM achieves 2-bit quantization while preserving most of the model’s accuracy.

Benjamin Marie June 29, 2024

Quantize Llama 3 8B with Bitsandbytes to Preserve Its Accuracy

Llama 2 vs. Llama 3 vs. Mistral 7B, quantized with GPTQ and Bitsandbytes Benjamin Marie · Follow Published in Towards Data Science · 6 min read · 7 hours ago — Generated with DALL-E With quantization, we can reduce the size of large language models (LLMs). Quantized LLMs are easier to run on GPUs with smaller memory, effectively serving as a compression method for LLMs.

Benjamin Marie May 27, 2024

Turn Llama 3 into an Embedding Model with LLM2Vec

RAG with Llama 3 for the generation and the retrieval Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 7 hours ago — Generated with DALL-E The embedding model is a critical component of retrieval-augmented generation (RAG) for large language models (LLMs). They encode the knowledge base and the query written by the user. Using an embedding model trained or fine-tuned for the same domain as the LLM can…

Benjamin Marie May 3, 2024

ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

A much cheaper alignment method performing as well as DPO Benjamin Marie · Follow Published in Towards Data Science · 7 min read · 20 hours ago — Generated with DALL-E There are now many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one of the first and brought us ChatGPT, but RLHF is very costly. DPO, IPO, and KTO are notably…

Benjamin Marie April 10, 2024

Benjamin Marie