Galileo Hallucination Index Highlights GPT-4 as the Top LLM Across Various Applications

Galileo Hallucination Index Highlights GPT-4 as the Top LLM Across Various Applications

Join our daily and weekly newsletters for the latest updates on industry-leading AI coverage. A new hallucination index, developed by the research arm of San Francisco-based Galileo, shows that OpenAI’s GPT-4 model is the best performer and has the least tendency to hallucinate when dealing with multiple tasks.

Published today, the index analyzed nearly a dozen open and closed-source large language models (LLMs), including Meta’s Llama series, to assess their performance across different tasks and identify which models least likely exhibit hallucinations.

The results revealed that while each LLM behaved differently depending on the task, OpenAI’s models consistently outperformed the others. This finding is significant for enterprises looking to overcome the challenge of hallucinations, which has hindered the deployment of LLMs in critical sectors like healthcare.

Tracking LLM hallucinations is challenging. Although there is significant interest from enterprises in using generative AI, they often encounter performance issues where LLM responses are not completely accurate. These inaccuracies arise because LLMs generate text based on a database of related terms and concepts, regardless of the factual correctness.

Atindriyo Sanyal, co-founder and CTO of Galileo, explained that the deployment of generative AI products involves many variables. For instance, a general-purpose tool that creates stories based on simple prompts differs greatly from an enterprise chatbot that helps customers with proprietary product information.

To tackle this, they chose eleven popular LLMs of various sizes and tested their likelihood of hallucinating against three common tasks: question and answer without retrieval augmented generation (RAG), question and answer with RAG, and long-form text generation. They used seven highly regarded datasets to rigorously test each LLM’s capabilities.

The Galileo team reduced the dataset sizes and annotated them to establish ground truth for accuracy. They then evaluated each model’s performance using Galileo’s proprietary Correctness and Context Adherence metrics. Correctness focused on logical and reasoning-based mistakes, while Context Adherence assessed how well an LLM reasoned within provided documents and contexts.

When it comes to questions and answers without retrieval, OpenAI’s GPT models led the pack. The GPT-4-0613 model scored a correctness score of 0.77, followed by GPT-3.5 Turbo-1106, -Instruct, and -0613 with scores of 0.74, 0.70, and 0.70, respectively. Meta’s Llama-2-70b was the closest competitor with a score of 0.65. Other models, like Llama-2-7b-chat and Mosaic ML’s MPT-7b-instruct, fell behind.

For retrieval tasks, where the model gathers information from a dataset, GPT-4-0613 again topped the chart with a context adherence score of 0.76. GPT-3.5-turbo variants also performed well, with scores close to GPT-4. Hugging Face’s Zephyr-7b scored 0.71, outperforming Meta’s Llama-2-70b, which scored 0.68. The lowest scores in this category were UAE’s Falcon-40b and Mosaic ML’s MPT-7b.

In long-form text generation, GPT-4-0613 and Llama-2-70b scored 0.83 and 0.82 for correctness, showing the least hallucination. GPT-3.5-Turbo-1106 matched Llama-2-70b’s performance, while the 0613 variant scored 0.81. The MPT-7b model lagged with a score of 0.53.

While OpenAI’s GPT-4 excels across tasks, its API-based pricing can be costly. Galileo suggests using GPT-3.5-Turbo models for cost-effective performance, and in text generation scenarios, open-source models like Llama-2-70b can be a good alternative.

This index is continually evolving, with new models emerging and existing ones improving regularly. Galileo plans to update this index quarterly to provide accurate rankings of the least to most hallucinating models for various tasks.

Sanyal stated that the index aims to help teams address hallucinations and kick-start their generative AI efforts, offering metrics and evaluation methods that enable quicker and more effective assessment of LLM models.