Join our daily and weekly newsletters for the most recent updates and exclusive content on top-tier AI coverage. Learn More
Oh, Google. Will you ever get an AI product release right on the first try?
Less than a month after Google unveiled its long-awaited ChatGPT competitor, Gemini, it faced criticism for what turned out to be staged interactions in its demo video. Now, new research shows that Gemini Pro, the most powerful version available to consumers, lags behind OpenAI’s GPT-3.5 Turbo in most tasks.
Yes, you read that correctly: Google’s latest language model, developed over several months, is outperformed by OpenAI’s older and free model. In contrast, ChatGPT Plus and Enterprise subscribers have access to GPT-4 and GPT-4V, which have been available for most of the year.
This finding comes from a team of researchers at Carnegie Mellon University and BerriAI. Their paper, “An In-depth Look at Gemini’s Language Abilities,” was recently published on arXiv.org. It plainly states that as of December 19, 2023, Gemini Pro achieves slightly inferior accuracy compared to OpenAI’s GPT-3.5 Turbo across all tasks.
For Google’s researchers and leadership, who have put in extensive hours on Gemini, these results must be disappointing. Although a Google spokesperson responded that their research shows Gemini Pro outperforms GPT-3.5 and that an upcoming version, Gemini Ultra, will surpass GPT-4, these claims remain to be seen.
“In our technical paper,” the spokesperson said, “we compare Gemini Pro and Ultra against other external language models and our previous best model, PaLM 2, across various benchmarks.” They added that Gemini Ultra can achieve a 90.04% accuracy on the MMLU benchmark, surpassing all existing models.
However, the researchers note that evaluating benchmarks can be tricky and might be affected by data contamination. They conducted an extensive analysis to ensure scientific soundness but still encountered some issues. They decided not to report results for certain benchmarks due to these minor issues.
Despite these limitations, the researchers believe that Gemini models indicate potential for real-world tasks, particularly in education. For instance, Gemini Ultra’s competencies in reasoning and STEM could pave the way for advancements in personalized learning and intelligent tutoring systems.
The CMU and BerriAI researchers tested four different language models: Google Gemini Pro, OpenAI GPT-3.5 Turbo, GPT-4 Turbo, and Mixtral 8x7B from the French startup Mistral. Using the AI aggregator site LiteLLM, they conducted tests between December 11-15, 2023, including 57 multiple-choice questions across various subjects.
In these tests, Gemini Pro achieved a lower accuracy than both GPT-3.5 Turbo and GPT-4 Turbo. Interestingly, Gemini tended to select the answer “D” more frequently, suggesting that it might not be well-tuned for multiple-choice questions.
The researchers also found that Gemini performed worse than GPT-3.5 Turbo in categories like human sexuality, formal logic, elementary math, and professional medicine. This was partly because Gemini often refused to answer due to safety and content restrictions.
However, Gemini Pro did perform better in two categories of multiple-choice questions: security and high school microeconomics, though the gains were marginal. In general-purpose reasoning tasks without multiple-choice options, Gemini Pro again lagged behind GPT-3.5 and GPT-4 Turbo.
Gemini did outperform other models in word sorting and symbol manipulation, demonstrating its strength in word rearrangement and symbol order. However, it did not excel in math, programming, or web agent tasks.
Gemini’s standout performance came in translating content between languages, surpassing both GPT-3.5 Turbo and GPT-4 Turbo in eight out of 20 languages. Still, its content moderation system caused it to block responses in approximately 10 language pairs.
These results highlight Google’s challenges in matching OpenAI’s performance in the generative AI race. With Gemini Ultra not due until next year, Google may remain behind in AI performance for the time being.
Interestingly, Mistral’s Mixtral 8x7B, another competitor, also performed worse than OpenAI’s GPT-3.5 Turbo. Yet, Gemini Pro outperformed Mixtral on all tasks, offering a bright spot for Google in the open-source space.
Overall, the study reinforces that OpenAI remains the leader in generative AI for both consumers and enterprises. AI influencers, including Professor Ethan Mollick from the Wharton School, also agree that GPT-4 is the best option for most individual cases until Gemini Ultra is released. This paper solidifies the view that Google’s Gemini Pro is on par with OpenAI’s free ChatGPT 3.5.