Matrix multiplications (MatMul) are the most computationally expensive operations in large language models (LLM) using the Transformer architecture. As LLMs grow larger, the cost of MatMul increases significantly, leading to higher memory usage and latency during training and inference.
Researchers from the University of California, Santa Cruz, Soochow University, and University of California, Davis have introduced a novel architecture that eliminates matrix multiplications from language models while maintaining strong performance. Their paper presents MatMul-free language models that achieve comparable results to state-of-the-art Transformers but require much less memory during inference.
Matrix multiplication is essential in deep learning, combining data and weights in neural networks to make predictions. GPUs, with their parallel architecture, are designed to handle many MatMul operations simultaneously, making them crucial for training and running complex neural networks efficiently. However, as LLMs scale to hundreds of billions of parameters, MatMul operations become a bottleneck, necessitating large GPU clusters during both training and inference. Replacing MatMul with simpler operations could lead to significant savings in memory and computation, but previous attempts have yielded mixed results.
In their new paper, the researchers propose replacing the traditional 16-bit floating point weights used in Transformers with 3-bit ternary weights that can be -1, 0, or +1. They also replace MatMul with additive operations that achieve equally good results at much lower computational costs. The language models are composed of “BitLinear layers” using these ternary weights. By limiting weights to {-1, 0, +1} and applying additional quantization techniques, MatMul operations are replaced with addition and negation operations.
The Transformer architecture’s token mixer, which integrates information across different tokens using self-attention mechanisms, typically relies on MatMul operations. In contrast, the new MatMul-free architecture employs a MatMul-free Linear Gated Recurrent Unit (MLGRU) for the token mixer. The MLGRU processes sequences by updating hidden states through simple ternary operations, without the need for expensive matrix multiplications. The channel mixer, which integrates information across different feature channels within a single token’s representation, is implemented using a Gated Linear Unit (GLU). The researchers modified the GLU to work with ternary weights, reducing computational complexity and memory usage while maintaining effectiveness.
The researchers evaluated their MatMul-free language models against the advanced Transformer++ architecture used in Llama-2, across multiple model sizes. Their findings showed that the MatMul-free models were more efficient in leveraging additional compute resources to improve performance compared to Transformer++. The 2.7B MatMul-free LM outperformed its Transformer++ counterpart on benchmarks like ARC-Challenge and OpenbookQA, while maintaining similar performance on other tasks. The study highlights that MatMul-free architectures can achieve strong zero-shot performance on a variety of language tasks, such as question answering and commonsense reasoning.
Furthermore, the MatMul-free LM demonstrated lower memory usage and latency compared to Transformer++, with the memory and latency advantages becoming more noticeable as the model size increased. For example, the 13B model used only 4.19 GB of GPU memory at a latency of 695.48 ms, compared to Transformer++’s 48.50 GB of memory at a latency of 3183.10 ms.
The researchers also developed optimized GPU and FPGA configurations for the MatMul-free language models. Their optimized GPU implementation of the ternary dense layers accelerated training by 25.6% and reduced memory consumption by up to 61.0%, compared to an unoptimized baseline. This work demonstrates how scalable, yet lightweight, language models can reduce computational demands and energy use in practical applications.
Although they couldn’t test the MatMul-free architecture on very large models with over 100 billion parameters due to computational constraints, the researchers hope their work will encourage institutions and organizations with the necessary resources to invest in developing lightweight models. The ultimate goal is to make language models less dependent on high-end GPUs like those from Nvidia and enable powerful models to run on more affordable and accessible processors. The researchers have made their code and models available for the research community to build upon.
By prioritizing the development of MatMul-free architectures, the future of LLMs can become more accessible, efficient, and sustainable.