The Ultimate Handbook for LLM Quantization

A deep dive into LLM quantization and techniques

15 min read

9 hours ago

Photo by Siednji Leon on Unsplash

LLMs on CPU? Yes, you heard it right. From handling conversations to creating its own images, AI has come a long way since its beginnings. But it came with a bottleneck. As the models expanded, so did their computational demands. AI began to rely heavily on computational power. To meet these demands, we turned to GPUs, and the rest is history.

Many devices don’t possess powerful GPUs and so miss out on AI capabilities. It was necessary to scale down the size and power of these models to run an AI model on devices with limited computational power like a mobile phone or a computer with CPU only. Early efforts include techniques like pruning and distillation. However, these approaches were not viable when it came to LLMs, which typically have large-scale architectures.

The recent AI revolution with LLMs was more or less based on cloud servers for training, deployment, and inference. However, major players are now extending LLM capabilities to edge devices. Microsoft’s copilot+PCs is a great example and something to wait for. As we move toward edge deployment, optimizing LLM size becomes crucial without compromising performance or quality. One effective method to achieve this optimization is through quantization.

In this article, we will deeply explore quantization and some state-of-the-art quantization methods. We will also see how to use them.

Table of Contents

· Quantization: What & Why
Linear/Scale Quantization
Affine Quantization
Post-Training Quantization (PTQ)
Quantization-Aware Training
Why Quantize?
· Latest SOTA Quantization Methods
LLM.int8() (Aug 2022)
GPTQ (Oct 2022)
QLoRA (May 2023)
AWQ (Jun 2023)
Quip# (Jul 2023)
GGUF (Aug 2023)
HQQ (Nov 2023)
AQLM (Feb 2024)
· Conclusion
· References

Quantization: What & Why ❓

The weights of a neural network can be represented in various datatypes based on the precision requirements and computational resources available. Quantization is a procedure that maps the range of high precision weight values like FP32, which is determined by the [min, max] of the datatype, into lower precision values such as FP16 or even INT8 (8-bit Integer) datatypes.

Image By Author

Consider your 400M parameter LLM. Usually, weights are stored in FP32(32-bit). The memory footprint of this model can be calculated as:

4×10⁸ params x 4 bytes = 1.6 Gigabytes

Quantizing the above model can reduce the size significantly. Consider a conversion from FP32 to INT8. The quantized model’s memory footprint can be calculated as:

4×10⁸ params x 1 byte = 0.4 Gigabytes

That is 1/4th of the original size! This helps the model occupy less memory and also enhances inference speed although it may compromise the accuracy a bit. Also, some of these lightweight models can be easily handled by a CPU.

Quantization & Dequantization (Image By Author)

The range mapping of weights during quantization is typically accomplished using two methods.

Linear/Scale Quantization

Here quantization is similar to scaling in the specified range. Rmin is mapped to Qmin and Rmax to Qmax accordingly. The 0 in the actual range is mapped to a corresponding zero_point in the quantized range.

Image by Author

Affine Quantization

This method allows representing more asymmetric ranges. The parameters here are:

For INT8 datatype, the equation can be

After this transformation is applied, some data will be out of range. To bring them in range, an additional clip operation is used.

When it comes to quantizing large language models (LLMs), there are two primary types of quantization techniques:

Post-Training Quantization (PTQ)

As the name suggests, the LLM is quantized after the training phase. The weights are converted from a higher precision to a lower precision data type. It can be applied to both weights and activations. Although speed, memory, and power usage are highly optimized, there is an accuracy trade-off.
During quantization, rounding or truncation occurs, introducing quantization error. This error affects the model’s ability to represent fine-grained differences between weights.

Quantization-Aware Training

This technique was developed to mitigate the potential loss of model accuracy in the case of PTQ. In contrast to PTQ, the quantization process is integrated with the training itself, hence making the process “Quantization Aware”.

In QAT, the model architecture is initially modified to maintain both full-precision and quantized versions of elements, which includes weights and activations, thereby creating a dual storage system. During the forward pass of the training process, a simulated or “fake” quantization is introduced to the model, allowing it to experience the effects of quantization while still preserving the precision when calculating gradients, thereby enhancing the model’s robustness to quantization.

Source: Paper

🎯Why Quantize?

  1. Reduced Memory Footprint
    Quantization reduces the memory requirements of the LLM so well that they can be conveniently deployed on lower-end machines and edge devices. Many edge devices support only integer data type storage.
  2. Faster Inference
    Lower precision computations (integer) are inherently faster than higher precision (float) computations. Therefore, by using quantized weights, the mathematical operations during inference are expedited. Also, many modern CPUs and GPUs have specialized instructions for lower-precision computations, which can be leveraged when quantizing the model. This hardware acceleration can significantly boost the speed of inference.
  3. Reduced Energy Consumption
    Many modern hardware accelerators are optimized for lower-precision computations. These accelerators can perform more operations per watt of energy when the model is quantized.

Latest SOTA Quantization Methods

LLM.int8() (Aug 2022)

It involves converting the weights from FP16 to INT8, effectively halving the size of the LLM. The method claims to efficiently reduce the size of LLMs up to 175B parameters without performance degradation.

Before going to the details of the paper [1], it’s important to understand that LLMs have emergent features — patterns that arise from the training data and are crucial for the model’s performance. Some of these features can have large magnitudes and can exert a strong influence over the model’s overall performance.

Source: Paper

Steps involved:

  1. The LLM.int8() method starts with vector-wise quantization. This means that each vector (a row in the matrix) is quantized separately, using its own normalization constant. The relative significance of each feature is thus preserved.
  2. For each vector, a normalization constant is calculated that is used to scale the vectors so that they can be represented as 8-bit integers. By using the normalization constants, most of the features in the LLM are quantized.
  3. For emergent outliers — features with unusually large magnitudes — a mixed-precision decomposition scheme is used. This isolates these outlier features into a separate 16-bit matrix multiplication, ensuring they are handled accurately while still allowing more than 99.9% of the values to be multiplied in 8-bit.

Pros
LLMs can be quantized and used immediately for inference without performance degradation.

Cons
The method focuses only on the INT8 datatype and models of up to 175B parameters (especially OPT-175B / BLOOM).

Code Implementation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

GPTQ (Oct 2022)

GPTQ was an early one-shot PTQ technique that enabled efficient deployment of large language models. It was achieved mainly through the two features proposed in the paper [4],

  1. Layerwise Quantization
    Quantization is performed layer by layer in the LLM. The goal is to find a simpler version of the weights that still gives us a good result when we use it to make predictions. This is done in a way that the difference between the original and the simplified weights is as small as possible- ie, lowest mean squared error.
  2. Optimal Brain Quantization
    It is an algorithm intended to reduce errors introduced in the model due to quantization. While quantizing a weight, the remaining weights are adjusted.
Source: Paper

Pros
GPTQ allows for quantization up to 2 bits, providing a range of trade-offs between model size and performance.

Cons
Quantization by this method introduces considerable performance degradation.

Code Implementation

Install the required libraries.

pip install auto-gptq transformers accelerate

Load the model and quantize it with the autogptq library.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quant_config)

QLoRA (May 2023)

Before diving into QLoRA, here is a brief introduction to LoRA. LoRA (Low-Rank Adaptation of Large Language Models) is a parameter-efficient fine-tuning method used to specialize LLMs for particular tasks. It achieves this by integrating trainable matrices based on rank decomposition into every transformer layer. Moreover, it minimizes the number of parameters that need to be trained for the targeted task, all the while maintaining the original pre-trained model weights unchanged. Read more about it here.

Source: Paper

QLoRA is an enhanced version of LoRA. Here are the highlights in this method as described in the paper [2]:

1. 4-bit Normal Float Quantization:
The 4-bit Normal Float operates by calculating the 2ᵏ+1 quantiles (where k is the bit count) within a distribution ranging from 0 to 1, subsequently normalizing these values to fit within the [-1, 1] interval. With this normalization, we can similarly adjust our neural network weights to the [-1, 1] range and proceed with quantization.

2. Double Dequantization:
This involves quantizing the quantization constants employed in the 4-bit NF quantization process. It can conserve an average of 0.5 bits per parameter. This is beneficial because QLoRA utilizes Block-wise k-bit Quantization.

3. Paged Optimizations:
QLoRA involves efficient page transfers from GPU to CPU using Nvidia’s unified memory feature. This prevents GPU overloads and makes the training efficient without interrupting.

Pros
QLoRA, due to lower GPU memory usage, can support higher max sequence lengths and a higher number of batches.

Cons
It can be slower in terms of tuning speed. It also stands on the lower side in cost efficiency but that is not a matter of concern.

Code Implementation

Install the required libraries

pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
pip install -q datasets bitsandbytes

Load the model and tokenizer. Configure the LoRA parameters.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_id = "meta-llama/Llama-2-7b-chat-hf"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False

from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
task_type="CAUSAL_LM"
)

Set up the trainer using SFTTrainer from the TRL library that gives a wrapper around transformers Trainer to easily fine-tune models on instruction-based datasets using PEFT adapters. Of course, you will need a dataset to train.

from transformers import TrainingArguments

output_dir = "./models"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 100
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
fp16=True,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=True,
lr_scheduler_type=lr_scheduler_type,
)

from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
)

trainer.train()

AWQ (Jun 2023)

AWQ (Activation-Aware Weight Quantization) is a Post-Training Quantization method. In this method, the activations of the model are considered in place of weights. Let me quote it directly from the paper [3],

Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights.

Source: Paper

Pros
AWQ provides more accuracy than other methods as weights critical to the LLM performance are preserved. It is also efficient and faster as it does not involve backpropagation or reconstruction. It performs well on edge devices.

Cons
While maintaining 0.1% of weights in FP16 can enhance the performance of quantization without significantly increasing the model size, this mixed-precision data type complicates system implementation.

Code Implementation

Install required libraries.

!pip install autoawq transformers accelerate

Load the model and quantize it with the autoawq library.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = 'meta-llama/Llama-2-7b-hf'
quant_path = 'Llama2-7b-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

Quip# (Jul 2023)

In simple terms, QuIP (Quantization with Incoherence Processing) is based on the idea that the process of quantization can be improved if the weights of the model are evenly distributed (incoherent), and the important directions for rounding them are not aligned with the coordinate axes. It consists of two steps:

  1. LDLQ Adaptive rounding procedure: Adjust the weights of the model in a way that minimizes a certain measure of error (the ‘quadratic proxy objective’) [8].
  2. Pre- and post-processing: Multiply the weight and Hessian matrices by random orthogonal matrices. This ensures that the weights and Hessians are incoherent, which is beneficial for the quantization process.
Source: Paper

QuIP# [5] advances on QuIP using some improvements in processing.

  1. Improved Incoherence Processing: It uses a faster and better method called the randomized Hadamard transform.
  2. Vector Quantization: QuIP# uses vector quantization to leverage the ball-shaped sub-Gaussian distribution that incoherent weights possess. Specifically, it introduces a set of hardware-efficient codebooks based on the highly symmetric E8 lattice. The E8 lattice achieves the optimal 8-dimension unit ball packing, which means it can represent the weights more efficiently.

Pros
Compared to other methods, QuIP# offers significantly higher throughput (>40%) at the same or better quantization quality. That is not bad for a 2-bit quantization.

Cons
Although not many limitations are mentioned, complexity and hardware compatibility can be considered.

Code Implementation

Clone the official repo and install the required libraries.

git clone https://github.com/Cornell-RelaxML/quip-sharp.git
pip install -r requirements.txt
cd quiptools && python setup.py install && cd ../

Find the scripts for various models. Run the script quantize_finetune_llama.py to use llama models.

Also, check out the repo for quip quantization. The code for quantizing models is as shown.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from quantizer import QuipQuantizer

model_name = "meta-llama/Llama-2-70b-hf"
quant_dir = "llama-70b_2bit_quip"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quant = QuipQuantizer(codebook="E8P12", dataset="redpajama")
quant.quantize_model(model, tokenizer, quant_dir)

GGUF (Aug 2023)

GGUF (GPT-Generated Unified Format) was a highly anticipated release by Georgi Gerganov and the llama.cpp team. The main highlight was indeed the feature that LLMs could now be run easily on consumer CPUs. Earlier it was called GGML and later upgraded to GGUF.
A notable achievement of GGML was the ability to offload certain layers of the LLM to GPU if required even while the LLM operates on the CPU. This effectively addresses the global challenge developers face due to inadequate VRAM.

Pros
If you plan to run LLMs on CPU or Apple devices (the M series chips), it is the goto method for many LLMs like Llama and Mistral. GGUF file format is now well supported by llama.cpp and HuggingFace. GGUF models also show lower perplexity scores compared to other formats.

Cons
GGUF is focused on CPU and Apple M series devices. This could be a limitation if you’re working with different hardware configurations.

Code Implementation

Install the ctransformers library.

pip install ctransformers[cuda]

Models are available in the repositories by Bloke in HuggingFace.

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
"HuggingFaceH4/zephyr-7b-beta", use_fast=True
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
"HuggingFaceH4/zephyr-7b-beta", use_fast=True
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

HQQ (Nov 2023)

According to the paper, weight calibration can be achieved by data-free calibration techniques (BitsAndBytes) and calibration-based techniques (GPTQ and AWQ). While calibration-free methods are faster, calibration-based methods suffer from data bias and quantization time.

HQQ (Half-Quadratic Quantization) carries out quantization in real time using rapid and sturdy optimization. It eliminates the need for calibration data and is versatile enough to quantize any given model, thus achieving speed of calibration-free methods without data bias issues. It drastically reduced quantization time to almost a few minutes due to optimization techniques like half-quadratic splitting. For more details on the math and working of the method, see the official website.

Pros
Achieved surprisingly low quantization time compared to other methods (50x faster compared to GPTQ!). The elimination of calibration data requirements makes it easier.

Cons
Not many limitations are mentioned elsewhere. It may still show quality degradation like other methods.

Code Implementation

Install the transformers library and use HQQ implementation straightaway!

from transformers import AutoModelForCausalLM, HqqConfig

# All linear layers will use the same quantization config
quant_config = HqqConfig(nbits=4, group_size=64, quant_zero=False, quant_scale=False, axis=1)

model_id = "meta-llama/Llama-2-7b-hf"

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quant_config
)

AQLM (Feb 2024)

AQLM (Additive Quantization of Language Models) is a weight-only PTQ method that sets a new benchmark in the 2-bit-per-parameter range. It outperforms popular algorithms like GPTQ as well as QuIP and QuIP#.

It applies a new method called Multi-Codebook Quantization (MCQ) which divides each vector into sub-vectors and approximates them using a finite set of codewords. Codewords are already learned vectors defined in a codebook [7]. AQLM works by taking the rows of the weight matrices in a model and quantizing them.

Source: Paper

Pros
AQLM offers rapid implementations for token generation on both GPU and CPU, allowing it to surpass the speed of optimized FP16 implementations, all while operating within a significantly reduced memory footprint.

Cons
Only a few limitations are mentioned elsewhere. It may still show quality degradation like other methods.

Code Implementation

The instructions on how to quantize models yourself and the corresponding code can be found in the official repo. To run AQLM models, load a model that has been quantized with AQLM:

from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

Conclusion

Quantization methods have opened up a world of possibilities, enabling advanced language processing capabilities even in our pockets. In this article, we discussed all about LLM quantization and explored in detail various methods to quantize LLMs. We also went through the ups and downs of each approach and learned how to use them. Furthermore, we gained insights on how to select the most suitable approach based on specific requirements and whether you are using a CPU or GPU.

If you liked the content make sure to follow me on Medium. Feel free to connect on LinkedIn. Stay tuned for more content LLMs! 💖

References

[1] Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.Int8(): 8-bit Matrix Multiplication for Transformers at Scale. ArXiv. /abs/2208.07339

[2] Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. ArXiv. /abs/2305.14314

[3] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W., Wang, W., Xiao, G., Dang, X., Gan, C., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. ArXiv. /abs/2306.00978

[4] Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ArXiv. /abs/2210.17323

[5] Tseng, A., Chee, J., Sun, Q., Kuleshov, V., & De Sa, C. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. ArXiv. /abs/2402.04396

[6] Hicham Badri , Appu Shaji (2023). Half-Quadratic Quantization of Large Machine Learning Models.https://mobiusml.github.io/hqq_blog/

[7] Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. ArXiv. /abs/2401.06118

[8] Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., & Blankevoort, T. (2020). Up or Down? Adaptive Rounding for Post-Training Quantization. ArXiv. /abs/2004.10568

[9] HuggingFace Documentation (2023).Documentation

[10] OpenGenus IQ. Basics of Quantization in Machine Learning (ML) for Beginners (2024). OpenGenus IQ

[11] Abonia Sojasingarayar.LLM Series — Quantization Overview (2023).Medium

Images

If not otherwise stated, all images are created by the author.