Do Large Language Model
Benchmarks Test Reliability?
Joshua Vendrow*, Edward Vendrow*, Sara Beery†, and Aleksander Mądry†
Massachusetts Institute of Technology
*Equal contribution. †Equal advising.
When deploying LLMs in real-world applications, reliability is crucial - models need to consistently provide correct answers, not just perform well on average. To measure this kind of reliability, We propose Platinum Benchmarks that are carefully curated to minimize label errors and ambiguity, where perfect performance is possible. As a first attempt at constructing such benchmarks, we manually revised fifteen existing benchmarks remove dataset errors.
Live Leaderboard
Every single error corresponds to a genuine mistake that a frontier LLM makes. By clicking on each model/benchmark pair below, you can look at the exact questions the model failed on and how it messed up. Doing so can actually tell us a lot about the ways that LLMs fail (see “New Failure Patterns” below).
We plan to update the leaderboard as new models are released. If you would like to see your model evaluated here, please open a github issue or email us!
Press on a box to view the corresponding errors.
6.9%
1
Error
100%
1
Error
2
Errors
4
Errors
6
Errors
6
Errors
14
Errors
14
Errors
7
Errors
13
Errors
15
Errors
24
Errors
27
Errors
7.0%
100%
100%
1
Error
4
Errors
3
Errors
8
Errors
10
Errors
13
Errors
5
Errors
11
Errors
10
Errors
10
Errors
43
Errors
31
Errors
7.1%
1
Error
100%
100%
5
Errors
3
Errors
13
Errors
11
Errors
17
Errors
15
Errors
18
Errors
12
Errors
13
Errors
23
Errors
21
Errors
11.2%
100%
1
Error
100%
27
Errors
10
Errors
11
Errors
19
Errors
12
Errors
18
Errors
30
Errors
25
Errors
21
Errors
62
Errors
39
Errors
Key:
Key Findings
No Model is Truly Reliable
Despite demonstrating advanced capabilities like solving graduate-level problems, every model we tested still makes mistakes on basic tasks. For instance, we found models that could tackle complex calculus problems failing to perform elementary arithmetic or answer simple questions about event sequences. These aren't rare edge cases - the failures occur consistently and predictically.
Current Benchmarks Hide Problems
When we examined popular benchmarks like GSM8K and SVAMP, we found significant rates of errors and ambiguities in the questions themselves. In GSM8K, about 5% of questions contained problems. This benchmark noise has masked true model performance - many reported "errors" were actually correct responses to flawed questions.
Different Models, Different Strengths
While no model achieved perfect performance across our tests, we found interesting patterns in their reliability. OpenAI's o1-mini showed the strongest performance on mathematics, while Claude Sonnet 3.5 excelled at reading comprehension. This suggests that choosing the right model depends on the specific task.
Revising Noisy Benchmarks
Nearly all benchmarks have some level of noise, whether from mislabeled answers or ambigous questions. But to clean them, manually inspecting every example from a benchmark would be extremely time-consuming. To speed up the process, we first show each question to twenty different LLMs, and inspect any question for which any of the models made a mistake. We expect that if a question is ambiguous, models would disagree among themselves, and if a question is mislabeled, models would be likely to have a different answer than the stated solution. So, this approach should catch these kinds of errors. Here are examples of some of the benchmark errors that we found:
You had 14 bags with equal number of cookies. If you had 28 cookies and 86 can- dies in total. How many bags of cookies do you have?
Given Answer: 2
Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, putting an equal number of the new cows into each of the twenty stalls. How many cows are in 8 of the stalls?

Question: Does the baby have socks on?
A curve is given parametrically by the equations
Options:
A)π/2 B)π C)2+π D)2π
Examples of errors in current LLM benchmarks.
It turns out that many “saturated” benchmarks are indeed riddled with errors. Now that we have cleaned up these benchmarks, what can they tell us about LLM reliability?
New Failure Patterns
Through careful analysis of model mistakes, we identified previously unknown patterns of questions where models consistently fail in predictable ways.
First Event Bias
When asked "What happened second: X or Y?" several models consistently answer with the first event - even while explicitly acknowledging in their reasoning that they've identified the first event rather than the second.
[context paragraph] What happened second: Russians blocked Azov or Treaty of Constantinople?
Prime Number Effects
We discovered that Claude 3.5 Sonnet frequently makes arithmetic errors when answers are prime numbers or close to prime numbers. For example, when dividing to get a whole number result, the model often incorrectly rounds up if the answer is prime, even though no rounding is needed.
The school is planning a field trip. The school has 67 classrooms. There are 66 students in each classroom in the school. If there are 6 seats on each school bus. How many buses are needed to take the trip?
Conclusion
Our work demonstrates that even the most advanced language models still struggle with basic tasks. The platinum benchmarks we've constructed are an initial step towards evaluating reliability, and we encourage researchers to use them to evaluate their frontier models—but more work is needed to create comprehensive tests that cover a wide range of tasks and domains.
We hope that our work motivates the adoption of platinum benchmarks in evaluating LLMs to ensure they meet the high reliability standards required in real-world applications.