Stay updated with our daily and weekly newsletters, delivering the latest industry-leading AI coverage right to your inbox.
Recently, AI startup Gradient teamed up with cloud computing platform Crusoe to extend the “context window” of Llama-3 models to an impressive 1 million tokens. Essentially, this determines how many input and output tokens a large language model (LLM) can handle at once. Major tech companies and cutting-edge AI labs are fiercely competing to push the boundaries of these context windows. In just a few months, models have expanded from handling a few thousand tokens to more than a million. However, LLMs with extremely large context windows are mainly restricted to private models like Anthropic Claude (200k tokens), OpenAI GPT-4 (128k tokens), and Google Gemini (1 million tokens).
Creating open-source models with extended context windows could revolutionize the LLM market, unlocking applications that private models cannot yet achieve.
The Need for Open-Source Long-Context LLMs
Gradient works with businesses keen on integrating LLMs into their operations. Before Llama-3’s release, Gradient was already encountering limitations related to context in their customer projects. For example, language models that assist in coding tasks, often called “coding copilots,” have become critical tools in many companies. These copilots typically generate small chunks of code, like functions, but there is a growing interest in expanding their capabilities to produce entire code modules.
To achieve this, the language model must reference entire codebases or multiple GitHub repositories. Providing the codebase to the LLM in smaller segments would be inefficient and inaccurate, as the model wouldn’t have access to the entire codebase simultaneously. By embedding entire codebases into the model’s context, these problems are mitigated, enabling the model to reason across all available data and produce more accurate, efficient outputs.
Companies often face restrictions on the data they can share with third parties, preventing them from using models like Gemini or Claude. This challenge led the Gradient team to develop their own million-token open model.
Open Research Contributions
The commercialization of large language models has made AI labs less willing to share their findings. Yet, the open research community continues to share insights that contribute to the overall advancement of these models. Gradient utilized numerous papers and open research from universities and institutes worldwide. They based their models on the 8-billion- and 70-billion-parameter versions of Meta’s Llama 3, which initially had a context window of 8,000 tokens.
Using techniques from Berkeley AI Research (BAIR) on distributed attention, they managed to increase the context length without incurring high memory and computation costs. They also relied on an open-source project from a Singapore research institute for initial code implementation and mathematical formulas from a Shanghai AI research lab. Nvidia’s benchmarks helped them evaluate their models’ performance relative to other long-context LLMs like Gemini.
Overcoming Compute Challenges
One significant challenge in LLM research is the availability of compute resources. Gradient collaborated with Crusoe, which is developing a purpose-built AI cloud to assist partners in building and exploring models cost-effectively. Crusoe’s new Nvidia L40S cluster played a crucial role in this collaboration. While these chips are typically used for inference, Crusoe demonstrated their ability to handle large-scale training as well.
Big tech companies are vying for high-end GPUs like A100, H100, and the forthcoming B100, each costing tens of thousands of dollars, with server clusters running into the millions. Crusoe offers high-end GPUs, including AMD’s MI300X and various Nvidia models. They worked with Gradient to customize the L40S cluster, significantly reducing the cost of training models.
Model Evaluation Techniques
Evaluating long-context windows involves tests like the “needle in a haystack,” where a specific piece of information is hidden within a long text sequence, and the model is queried about it. Gradient’s models perform near-perfectly on this test up to around 2 million tokens, comparable to the Gemini 1.5 Pro. However, this test alone doesn’t fully gauge a model’s performance. More complex measures, such as multiple needles in a haystack or “adversarial needles,” were also considered.
Their models were tested on RULER, a benchmark from Nvidia designed for long-context language models, which includes 13 tasks with varying sequence lengths and complexities. They’re also refining the models for many-shot in-context learning, allowing them to adapt to new tasks by including hundreds or thousands of examples in the prompt.
Enterprise Applications
Long-context open models could dramatically simplify the development of LLM-based applications for companies and developers. Longer contexts enable systems to handle more information per call, reducing the need for multiple queries and complex data processing pipelines. For example, style transfer tasks could be streamlined by feeding all relevant documents directly into the model.
These models could potentially minimize the need for retrieval-augmented generation (RAG), where relevant documents are inserted into the context for every query. With virtually infinite context, you could insert all pertinent documents into the prompt, allowing the model to pick the most relevant parts for each query.
Long-context models also lower barriers for creating prototypes or proofs of concept, helping product teams better understand the capabilities of language models. Demonstrating possibilities with prototypes can ease the first step of figuring out what an AI solution can achieve for a business.
Staying informed has never been easier—subscribe to our newsletters for the latest updates and insights on AI advancements.