Microsoft Introduces Orca 2: Compact Language Models Surpassing Larger Rivals

Microsoft Introduces Orca 2: Compact Language Models Surpassing Larger Rivals

Sign up for our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI news.

Despite the power struggle and mass resignation at OpenAI, Microsoft’s AI efforts are pressing ahead. Today, Microsoft Research, under the leadership of Satya Nadella, introduced Orca 2. These are two small language models that can match or even outperform models five to ten times their size, such as Meta’s Llama-2 Chat-70B, especially in complex reasoning tasks where no prior examples (zero-shot settings) are given.

The Orca 2 models come in two versions: one with 7 billion parameters and the other with 13 billion. These build on the previous 13B Orca model, which showed strong reasoning skills by mimicking the step-by-step reasoning of larger models.

According to a joint blog post from Microsoft researchers, Orca 2 demonstrates how better training techniques can enhance the reasoning abilities of smaller language models, which usually are only seen in much larger models. Microsoft has made both models open-source to foster further research and development of small models that can perform as effectively as their larger counterparts. This could benefit enterprises, especially those with limited resources, by offering advanced AI capabilities without heavy investment in computing infrastructure.

Enhancing reasoning in small models

While large language models like GPT-4 impress with their reasoning and ability to tackle complex questions, smaller models have struggled in this area. Microsoft Research aimed to close this gap by fine-tuning Llama 2 base models using a highly specialized synthetic dataset. Instead of having smaller models simply imitate the behavior of larger models (a method known as imitation learning), the researchers trained them to use different strategies for different tasks. They recognized that what works for a large model might not be effective for a smaller one. For instance, GPT-4 might answer complex questions directly, whereas a smaller model might need to break the task into smaller steps.

In the Orca 2 project, the focus was on teaching the model various reasoning techniques, such as step-by-step reasoning and the “recall-then-generate” method. Additionally, the goal was to help the model learn when to use each strategy. The training data was provided by a more capable teacher model, ensuring that the student model learned both the strategy itself and when to apply it.

Orca 2’s impressive performance

On 15 diverse benchmarks, covering areas such as language understanding, common-sense reasoning, multi-step reasoning, math problem-solving, reading comprehension, summarizing, and truthfulness, Orca 2 largely matched or outperformed models that are significantly larger.

The benchmarks show that the 7B and 13B Orca 2 models outperformed Llama-2-Chat-13B and 70B, as well as WizardLM-13B and 70B. However, in the GSM8K benchmark, which includes 8.5K high-quality grade school math problems, WizardLM-70B performed better than the Orca models and Llama models.

While this is promising for businesses needing cost-effective, high-performing models, it’s important to note that these models can still have limitations common to other language models or the base model they were fine-tuned on. Microsoft pointed out that the same techniques used for creating Orca models can be applied to other base models.

Despite some limitations, Orca 2 shows significant potential for future improvements in reasoning, specialization, control, and safety of smaller models. The use of carefully filtered synthetic data for post-training is a key strategy in these enhancements. As larger models continue to develop, Orca 2 represents a significant step in increasing the diversity and applications of smaller language models.

Expect more small, high-performing models

The release of Orca 2 models and ongoing research suggests that more efficient small language models will emerge soon. Recently, the Chinese unicorn 01.AI, founded by AI expert Kai-Fu Lee, released a 34-billion parameter model that supports both Chinese and English, outperforming the 70-billion Llama 2 and 180-billion Falcon models. They also offer a smaller model with 6 billion parameters that performs well on widely-used AI/ML benchmarks.

Similarly, Mistral AI, a young Paris-based startup, offers a 7 billion parameter model that outperforms larger models, including Meta’s Llama 2 13B.