Researchers at the University of California, Santa Cruz have made a breakthrough by creating a large language model (LLM) running on custom hardware that only sips a mere 13 watts, which is the equivalent of a modern LED light bulb. The researchers say this new LLM is 50 times more efficient that other LLMs running on traditional hardware, and is capable of competing with established models such as Meta’s Llama.
Modern neural networks make use of matrix multiplication, which is a technique where words are represented by numbers and stored in matrices, which are then multiplied with one another to create language. This process uses a lot of energy as data needs to be stored and then moved around between GPUs or other accelerators for the multiplication to take place. The team zeroed in on this part of LLMs for their research.
To move away from matrix multiplication, the researchers shifted to having the matrices use ternary numbers that allow for a shift to summing numbers instead. It builds upon the work done by Microsoft, who showed that this method is possible, although the company didn’t open source its models. Lead author Jason Eshraghian says that “from a circuit designer standpoint, you don’t need the overhead of multiplication, which carries a whole heap of cost.”
The other efficiency gains come from running the LLM on custom hardware with field-programmable gate arrays (FPGA). The researchers believe that they can squeeze out even more efficiency as they continue to optimize these technologies.
It’s always exciting seeing breakthroughs such as this one, especially as demand for AI continues to grow. Hopefully, big players in the space take a look at this LLM and glean information that can improve efficiency in the long term.