Do Large Language Model
Benchmarks Test Reliability?

Joshua Vendrow*, Edward Vendrow*, Sara Beery†, and Aleksander Mądry†

Massachusetts Institute of Technology

*Equal contribution. †Equal advising.

When deploying LLMs in real-world applications, reliability is crucial - models need to consistently provide correct answers, not just perform well on average. To measure this kind of reliability, We propose Platinum Benchmarks that are carefully curated to minimize label errors and ambiguity, where perfect performance is possible. As a first attempt at constructing such benchmarks, we manually revised fifteen existing benchmarks remove dataset errors.

Live Leaderboard

Every single error corresponds to a genuine mistake that a frontier LLM makes. By clicking on each model/benchmark pair below, you can look at the exact questions the model failed on and how it messed up. Doing so can actually tell us a lot about the ways that LLMs fail (see “New Failure Patterns” below).

We plan to update the leaderboard as new models are released. If you would like to see your model evaluated here, please open a github issue or email us!

Press on a box to view the corresponding errors.

o1 (high)
Claude Sonnet 3.5 (Oct)
o1 (med)
Deepseek-R1
Claude Sonnet 3.5 (June)
Llama 3.1 405B Inst
Qwen2.5-Max
GPT-4o (Aug)
Gemini 2.0 Pro
GPT-4o (Nov)
o1-preview
Deepseek-V3
o1-mini
Gemini 2.0 Flash Thinking (12/19)
Qwen2.5-72B-Instruct
o3-mini (high)
Grok 2
Mistral Large
Gemini 2.0 Flash
Llama 3.3 70B Inst
Llama 3.1 70B Inst
Gemini 1.5 Pro
GPT 4o mini
Claude Haiku 3.5
Gemini 1.5 Flash
Mistral Small
Score
SingleEq
Math
103 Qs
SingleOp
Math
153 Qs
MultiArith
Math
172 Qs
Logic 3-Obj
Logic
200 Qs
Hotpot QA
RC
189 Qs
SVAMP
Math
276 Qs
GSM8K
Math
280 Qs
TabFact
Table
174 Qs
Object Counting
Logic
192 Qs
Navigate
Logic
200 Qs
DROP
RC
210 Qs
SQuAD2.0
RC
169 Qs
MMLU HS Math
Math
268 Qs
Winograd WSC
CR
196 Qs

0.7%

100%

100%

100%

100%

100%

100%

2

Errors

100%

100%

100%

100%

5

Errors

100%

5

Errors

1.3%

100%

100%

100%

100%

100%

1

Error

2

Errors

100%

100%

100%

2

Errors

8

Errors

1

Error

8

Errors

Number of errors per model on each platinum benchmark. The score is the average error rate, equally weighted across categories.

Key:

Math
Mathematics
Logic
Logical Reasoning
Tab
Table Understanding
RC
Reading Comprehension
CR
Commonsense Reasoning

Key Findings

No Model is Truly Reliable

Despite demonstrating advanced capabilities like solving graduate-level problems, every model we tested still makes mistakes on basic tasks. For instance, we found models that could tackle complex calculus problems failing to perform elementary arithmetic or answer simple questions about event sequences. These aren't rare edge cases - the failures occur consistently and predictically.

Current Benchmarks Hide Problems

When we examined popular benchmarks like GSM8K and SVAMP, we found significant rates of errors and ambiguities in the questions themselves. In GSM8K, about 5% of questions contained problems. This benchmark noise has masked true model performance - many reported "errors" were actually correct responses to flawed questions.

Different Models, Different Strengths

While no model achieved perfect performance across our tests, we found interesting patterns in their reliability. OpenAI's o1-mini showed the strongest performance on mathematics, while Claude Sonnet 3.5 excelled at reading comprehension. This suggests that choosing the right model depends on the specific task.

Revising Noisy Benchmarks

Nearly all benchmarks have some level of noise, whether from mislabeled answers or ambigous questions. But to clean them, manually inspecting every example from a benchmark would be extremely time-consuming. To speed up the process, we first show each question to twenty different LLMs, and inspect any question for which any of the models made a mistake. We expect that if a question is ambiguous, models would disagree among themselves, and if a question is mislabeled, models would be likely to have a different answer than the stated solution. So, this approach should catch these kinds of errors. Here are examples of some of the benchmark errors that we found:

Mislabeled question, SVAMP

You had 14 bags with equal number of cookies. If you had 28 cookies and 86 can- dies in total. How many bags of cookies do you have?

Given Answer: 2

There are 14 bags, not 2.
Logical contradiction, GSM8K

Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, putting an equal number of the new cows into each of the twenty stalls. How many cows are in 8 of the stalls?

There are both ten and twenty stalls.
Ambiguity, VQA v2.0
VQA Ambiguity

Question: Does the baby have socks on?

There is no way to tell.
Clear flaw / ill-posed, MMLU HS Math

A curve is given parametrically by the equations

Options:

A)π/2 B)π C)2+π D)2π

The curve equations are missing.

Examples of errors in current LLM benchmarks.

It turns out that many “saturated” benchmarks are indeed riddled with errors. Now that we have cleaned up these benchmarks, what can they tell us about LLM reliability?

New Failure Patterns

Through careful analysis of model mistakes, we identified previously unknown patterns of questions where models consistently fail in predictable ways.

First Event Bias

When asked "What happened second: X or Y?" several models consistently answer with the first event - even while explicitly acknowledging in their reasoning that they've identified the first event rather than the second.

Example Question, DROP

[context paragraph] What happened second: Russians blocked Azov or Treaty of Constantinople?

Correct Answer: Treaty of Constantinople
Claude 3.5 Sonnet (Oct)
Answer: Russians blocked Azov
DeepSeek-R1
Answer: Russians blocked Azov
Mistral Large
Answer: The Russians blocked Azov second
Gemini 2.0 Flash
Answer: Russians blocked Azov

Prime Number Effects

We discovered that Claude 3.5 Sonnet frequently makes arithmetic errors when answers are prime numbers or close to prime numbers. For example, when dividing to get a whole number result, the model often incorrectly rounds up if the answer is prime, even though no rounding is needed.

Example Question, SVAMP

The school is planning a field trip. The school has 67 classrooms. There are 66 students in each classroom in the school. If there are 6 seats on each school bus. How many buses are needed to take the trip?

Correct Answer: 737
Claude 3.5 Sonnet (June)
Answer: 738

Conclusion

Our work demonstrates that even the most advanced language models still struggle with basic tasks. The platinum benchmarks we've constructed are an initial step towards evaluating reliability, and we encourage researchers to use them to evaluate their frontier models—but more work is needed to create comprehensive tests that cover a wide range of tasks and domains.

We hope that our work motivates the adoption of platinum benchmarks in evaluating LLMs to ensure they meet the high reliability standards required in real-world applications.

Authors

Joshua Vendrow
Edward Vendrow
Sara Beery
Aleksander Mądry