“Revolutionary Transformer Design Enhances Speed and Efficiency of Language Models”
Subscribe to our daily and weekly newsletters for up-to-date industry news and exclusive content on leading AI developments.
Large language models like ChatGPT and Llama-2 are known for their high memory and computational requirements, making them expensive to operate. Reducing even a small part of their size can result in significant cost savings.
To tackle this, researchers at ETH Zurich introduced an updated version of the transformer, the deep learning architecture that powers language models. This new design cuts down the transformer’s size while maintaining accuracy and boosting inference speed, offering a more efficient alternative for language models.
Language models depend on transformer blocks, standardized units that excel in interpreting sequential data like text. In each classic transformer block, there are two essential components: the “attention mechanism” and the multi-layer perceptron (MLP). The attention mechanism selectively focuses on different parts of the input data to understand their context and importance to one another. It helps the model grasp the relationships between words in a sentence, even if they’re far apart. Once the attention mechanism has highlighted the crucial information, the MLP refines and processes this data into a more complex representation that captures intricate relationships.
In addition to these core elements, transformer blocks feature “residual connections” and “normalization layers” to speed up learning and address common issues in deep neural networks. When these blocks are stacked to form a language model, their ability to identify complex relationships in training data improves, enabling advanced functionalities in modern language models. Despite the significant advancements these models have brought about, the basic design of the transformer block has stayed almost the same.
Addressing the high cost of training and operating large transformer models, the ETH Zurich researchers sought efficiency gains by simplifying the transformer block. By removing non-essential components, they not only reduced the parameter count but also increased model throughput.
Their tests showed that simplifying the transformer block did not affect training speed or performance in downstream tasks. Traditional transformer models have multiple attention heads, each with key, query, and value parameters that map interactions among input tokens. The researchers found that they could remove the V parameters and the related layer without loss of effectiveness. Additionally, they discarded the skip connections, which typically help avoid vanishing gradients in deep learning models.
The team restructured the transformer block to handle attention heads and the MLP in parallel rather than sequentially, a shift from the conventional approach. They also adjusted non-learnable parameters, tweaked the training methodology, and made architectural changes to ensure the model retained its learning abilities despite being more streamlined.
When tested on language models of various sizes, the new compact transformer block delivered impressive results. It reduced the standard transformer’s size by about 16% without sacrificing accuracy and achieved faster inference times. For a large model like GPT-3 with 175 billion parameters, this new architecture could save around 50 GB of memory.
The researchers believe their simplified models can train faster and better utilize the capacity offered by more depth. While their approach has been successful on smaller scales, its effectiveness on larger models is yet to be verified. Future improvements, such as customizing AI processors for this streamlined architecture, could further enhance its benefits.
The researchers are optimistic that their work can lead to simpler architectures in practical use, bridging the gap between theory and practice in deep learning and lowering the cost of large transformer models.