AutoRound: Accurate Low-bit Quantization for LLMs

Between quantization-aware training and post-training quantization

Generated with DALL-E

There are many quantization methods to reduce the size of large language models (LLM). Recently, better low-bit quantization methods have been proposed. For instance, AQLM achieves 2-bit quantization while preserving most of the model’s accuracy.