LiveBench: An Open Benchmark for Large Language Models Utilizing Pristine Test Data and Impartial Evaluation

LiveBench: An Open Benchmark for Large Language Models Utilizing Pristine Test Data and Impartial Evaluation

Join our daily and weekly newsletters for the latest updates and exclusive content on top-tier AI coverage. Learn More

A group of experts from Abacus.AI, New York University, Nvidia, the University of Maryland, and the University of Southern California has created a new benchmark to tackle the “serious limitations” found in industry standards. Named LiveBench, this versatile LLM benchmark provides uncontaminated test data, which is an issue when datasets are reused for training various models.

So, what’s a benchmark? It’s a standard test for assessing AI model performance. This evaluation involves a series of tasks or metrics to measure LLMs, giving researchers and developers a comparative tool to track progress in AI research.

LiveBench includes frequently updated questions from recent sources, automatically scoring answers based on objective truths. It covers a range of challenging tasks including math, coding, reasoning, language, instruction following, and data analysis.

The release of LiveBench is noteworthy due to contributions from Yann LeCun, a prominent figure in AI and Meta’s chief AI scientist, alongside others from Abacus.AI, Nvidia, and multiple universities.

Existing LLM benchmarks have their flaws, and according to Micah Goldblum from the team, “we needed better LLM benchmarks as the existing ones don’t align with our experiences.” With initial support from Abacus.AI, the project expanded into a larger collaboration with experts from NYU, Nvidia, USC, and the University of Maryland.

LiveBench was created in response to the increasing prominence of LLMs and the insufficiency of traditional benchmarks. Typical benchmarks include vast internet data, which inadvertently makes them unreliable as many LLMs encounter these benchmarks during training. Consequently, many benchmarks fail to accurately measure an LLM’s true abilities, only showcasing its memorization skills.

To address this, LiveBench introduces new questions monthly to minimize test data contamination. These questions come from up-to-date datasets, math competitions, academic papers, news articles, and IMDb movie synopses. Each question has a verifiable answer, ensuring automatic and accurate scoring without an LLM judge. Currently, LiveBench offers 960 questions, with more difficult ones added every month.

Initially, LiveBench offers 18 tasks across six categories:
– Math: Questions from recent high school math competitions and tougher versions of AMPS questions.
– Coding: Code generation and a new code completion task.
– Reasoning: Advanced versions of Big-Bench Hard’s Web of Lies and positional reasoning puzzles.
– Language Comprehension: Tasks including word puzzles, typo corrections, and movie synopsis unscrambling.
– Instruction Following: Tasks for paraphrasing, simplifying, summarizing, or generating stories based on recent articles.
– Data Analysis: Tasks involving recent datasets for table reformatting, column predictions, and type annotations.

These tasks range in difficulty, aiming for top models to hit a success rate between 30 percent and 70 percent.

As of June 12, 2024, many prominent models have been evaluated with LiveBench, exhibiting its rigorous standards. For instance, OpenAI’s GPT-4o scores 53.79, leading the leaderboard, followed closely by GPT-4 Turbo with 53.34, and Anthropic’s Claude 3 Opus with 51.92.

For business leaders, understanding and choosing the right AI model can be daunting. Benchmarks like LiveBench can simplify this by providing reliable performance evaluations. Goldblum adds that comparing models with LiveBench is straightforward since it eliminates issues like test data contamination and biased evaluations.

Comparing LiveBench to established benchmarks reveals that, while trends are similar, individual model scores differ due to factors like biases in LLM judgments. For example, OpenAI’s GPT-4 models perform better on benchmarks where GPT-4 itself is the evaluator, highlighting inherent biases.

LiveBench isn’t a startup but an open-source benchmark that anyone can use and contribute to. More questions and tasks will be added monthly to adapt to the evolving capabilities of LLMs. As stated by Colin White, “good benchmarks are crucial for designing effective models, and LiveBench represents a significant step forward.”

Developers can access LiveBench’s code on GitHub and its datasets on Hugging Face.

Stay informed with VB Daily for the latest news. Subscribe to receive updates directly in your inbox.