Researchers at ETH Zurich have come up with a new method that can make neural networks work much faster. By changing the way these networks operate, they managed to drastically cut down the amount of computing power needed.
In experiments using BERT, a popular model for various language-related tasks, they were able to reduce computations by more than 99%. This new technique can also be applied to other transformer models, such as those used in large language models like GPT-3, which could lead to faster and more efficient language processing.
Transformers, the foundation of large language models, consist of multiple layers, including attention layers and feedforward layers. The feedforward layers are particularly demanding because they involve complex mathematical operations. However, the researchers found that not all neurons in these layers need to be active all the time during the inference process. They introduced “fast feedforward” layers to replace the traditional ones.
Fast feedforward layers use a method called conditional matrix multiplication, which is much more efficient than the dense matrix multiplications used in normal feedforward networks. This means that not all input parameters are multiplied by all neurons, greatly reducing the computational load.
To test their new approach, the researchers created FastBERT, a modified version of Google’s BERT model. FastBERT replaces the usual feedforward layers with fast feedforward layers, organizing neurons into a balanced binary tree and only activating one branch at a time based on the input.
When tested on several tasks from the General Language Understanding Evaluation (GLUE) benchmark, FastBERT performed almost as well as the original BERT models of similar size and training procedures. Some versions of FastBERT, trained for just one day on a single A6000 GPU, retained at least 96% of BERT’s performance while using only 0.3% of its feedforward neurons.
The researchers believe that integrating fast feedforward networks into large language models like GPT-3 has huge potential. For instance, GPT-3’s feedforward networks in each transformer layer contain 49,152 neurons. If trainable, these could be replaced with a fast feedforward network that uses just 16 neurons for inference, which is a mere 0.03% of GPT-3’s neurons.
Despite significant hardware and software advancements for dense matrix multiplication—the current standard in feedforward neural networks—there isn’t yet an efficient, native implementation of conditional matrix multiplication. No popular deep learning framework currently provides an interface to implement this technique beyond high-level simulations.
The researchers created their own implementation of these operations, achieving a 78-fold speed improvement during inference. They believe that with better hardware and optimized low-level implementation, the speed could potentially improve by more than 300 times, which would significantly enhance the number of tokens generated per second.
This research aims to address the memory and compute bottlenecks of large language models, paving the way for more efficient and powerful AI systems.