PlatinumBench

Overview Error Viewer Paper

🚀 We just released GSM8K-Platinum! Check out our blog post here.

Do Large Language Model
Benchmarks Test Reliability?

Joshua Vendrow*, Edward Vendrow*, Sara Beery†, and Aleksander Mądry†

Massachusetts Institute of Technology

*Equal contribution. †Equal advising.

Paper Blog Code

HuggingFace

load_dataset("madrylab/platinum-bench")

When deploying LLMs in real-world applications, reliability is crucial - models need to consistently provide correct answers, not just perform well on average. To measure this kind of reliability, We propose Platinum Benchmarks that are carefully curated to minimize label errors and ambiguity, where perfect performance is possible. As a first attempt at constructing such benchmarks, we manually revised fifteen existing benchmarks remove dataset errors.

Live Leaderboard

Every single error corresponds to a genuine mistake that a frontier LLM makes. By clicking on each model/benchmark pair below, you can look at the exact questions the model failed on and how it messed up. Doing so can actually tell us a lot about the ways that LLMs fail (see “New Failure Patterns” below).

We plan to update the leaderboard as new models are released. If you would like to see your model evaluated here, please open a github issue or email us!

Press on a box to view the corresponding errors.

Gemini 2.5 Pro (Exp)

o1 (high)

Grok 3 Beta

Claude 3.7 Sonnet Thinking 16k

GPT-4.5 (Preview)

Claude 3.7 Sonnet

Claude Sonnet 3.5 (Oct)

GPT 4.1 (2025-04-14)

o1 (med)

Deepseek-R1

Grok 3 Mini Beta (high)

Claude Sonnet 3.5 (June)

Llama 3.1 405B Inst

Qwen2.5-Max

Llama 4 Maverick

GPT-4o (Aug)

o1-preview

GPT-4o (Nov)

Gemini 2.0 Pro

Deepseek-V3

o1-mini

Gemini 2.0 Flash Thinking (12/19)

Qwen2.5-72B-Instruct

o3-mini (high)

Grok 2

Mistral Large

Gemini 2.0 Flash

Llama 3.3 70B Inst

Llama 3.1 70B Inst

Gemini 1.5 Pro

GPT 4o mini

Claude Haiku 3.5

Gemini 1.5 Flash

Mistral Small

Score

SingleEq

Math

100 Qs

SingleOp

Math

150 Qs

MultiArith

Math

170 Qs

Logic 3-Obj

Logic

200 Qs

Hotpot QA

181 Qs

SVAMP

Math

265 Qs

GSM8K

Math

268 Qs

TabFact

Table

169 Qs

Object Counting

Logic

190 Qs

Navigate

Logic

200 Qs

DROP

209 Qs

SQuAD2.0

161 Qs

MMLU HS Math

Math

267 Qs

Winograd WSC

195 Qs

0.5%

100%

1

Error

100%

2

Errors

100%

2

Errors

100%

8

Errors

3

Errors

0.7%

100%

2

Errors

100%

5

Errors

100%

5

Errors

0.8%

100%

1

Error

100%

3

Errors

1

Error

100%

2

Errors

1

Error

7

Errors

100%

2

Errors

0.9%

100%

1

Error

100%

1

Error

1

Error

100%

3

Errors

5

Errors

4

Errors

4

Errors

1.0%

100%

1

Error

100%

1

Error

100%

2

Errors

3

Errors

3

Errors

7

Errors

1.1%

100%

3

Errors

100%

1

Error

100%

4

Errors

4

Errors

14

Errors

5

Errors

1.1%

100%

1

Error

3

Errors

1

Error

100%

1

Error

6

Errors

3

Errors

21

Errors

3

Errors

1.1%

100%

3

Errors

1

Error

4

Errors

100%

3

Errors

4

Errors

6

Errors

4

Errors

4

Errors

1.3%

100%

1

Error

2

Errors

100%

2

Errors

8

Errors

1

Error

8

Errors

1.4%

100%

1

Error

100%

1

Error

1

Error

1

Error

1

Error

100%

1

Error

6

Errors

6

Errors

1

Error

7

Errors

1.5%

100%

3

Errors

1

Error

4

Errors

3

Errors

100%

3

Errors

6

Errors

2

Errors

6

Errors

1.8%

100%

2

Errors

5

Errors

3

Errors

100%

3

Errors

3

Errors

29

Errors

8

Errors

1.9%

100%

2

Errors

3

Errors

2

Errors

1

Error

1

Error

5

Errors

4

Errors

4

Errors

28

Errors

8

Errors

2.1%

100%

1

Error

1

Error

1

Error

3

Errors

2

Errors

2

Errors

5

Errors

3

Errors

5

Errors

9

Errors

11

Errors

7

Errors

2.1%

100%

1

Error

3

Errors

4

Errors

3

Errors

100%

2

Errors

4

Errors

8

Errors

7

Errors

10

Errors

2.2%

100%

3

Errors

6

Errors

3

Errors

1

Error

100%

4

Errors

3

Errors

9

Errors

22

Errors

10

Errors

2.3%

100%

3

Errors

1

Error

1

Error

3

Errors

15

Errors

3

Errors

5

Errors

10

Errors

2

Errors

6

Errors

2.5%

100%

2

Errors

5

Errors

5

Errors

1

Error

4

Errors

4

Errors

3

Errors

7

Errors

20

Errors

12

Errors

2.5%

100%

1

Error

1

Error

7

Errors

3

Errors

4

Errors

7

Errors

4

Errors

7

Errors

5

Errors

11

Errors

2.8%

100%

1

Error

100%

1

Error

2

Errors

3

Errors

1

Error

10

Errors

100%

3

Errors

7

Errors

12

Errors

16

Errors

2.9%

100%

1

Error

1

Error

1

Error

3

Errors

1

Error

2

Errors

1

Error

2

Errors

1

Error

3

Errors

9

Errors

4

Errors

19

Errors

3.0%

100%

1

Error

4

Errors

3

Errors

100%

4

Errors

2

Errors

5

Errors

3

Errors

6

Errors

6

Errors

2

Errors

17

Errors

3.1%

100%

3

Errors

4

Errors

4

Errors

7

Errors

3

Errors

10

Errors

3

Errors

5

Errors

8

Errors

16

Errors

12

Errors

3.2%

100%

1

Error

1

Error

2

Errors

1

Error

1

Error

100%

1

Error

100%

3

Errors

35

Errors

1

Error

14

Errors

3.2%

100%

1

Error

100%

1

Error

1

Error

4

Errors

3

Errors

5

Errors

2

Errors

5

Errors

5

Errors

8

Errors

14

Errors

15

Errors

3.4%

100%

4

Errors

3

Errors

7

Errors

3

Errors

3

Errors

1

Error

11

Errors

10

Errors

11

Errors

33

Errors

11

Errors

3.5%

100%

1

Error

100%

3

Errors

3

Errors

8

Errors

2

Errors

100%

7

Errors

6

Errors

7

Errors

5

Errors

22

Errors

3.6%

100%

2

Errors

6

Errors

7

Errors

4

Errors

1

Error

10

Errors

7

Errors

8

Errors

44

Errors

14

Errors

4.0%

100%

2

Errors

100%

4

Errors

1

Error

7

Errors

7

Errors

4

Errors

2

Errors

9

Errors

9

Errors

7

Errors

39

Errors

17

Errors

4.2%

100%

1

Error

2

Errors

1

Error

6

Errors

6

Errors

5

Errors

9

Errors

9

Errors

7

Errors

14

Errors

13

Errors

17

Errors

6.9%

1

Error

100%

1

Error

2

Errors

4

Errors

5

Errors

6

Errors

14

Errors

14

Errors

7

Errors

13

Errors

15

Errors

24

Errors

27

Errors

7.0%

100%

1

Error

4

Errors

3

Errors

8

Errors

10

Errors

13

Errors

5

Errors

11

Errors

10

Errors

9

Errors

43

Errors

31

Errors

7.1%

1

Error

100%

5

Errors

3

Errors

12

Errors

11

Errors

17

Errors

15

Errors

18

Errors

12

Errors

13

Errors

23

Errors

21

Errors

11.0%

100%

1

Error

100%

27

Errors

9

Errors

11

Errors

18

Errors

11

Errors

18

Errors

30

Errors

24

Errors

21

Errors

62

Errors

39

Errors

Number of errors per model on each platinum benchmark. The score is the average error rate, equally weighted across categories.

Key:

Math

Mathematics

Logic

Logical Reasoning

Tab

Table Understanding

Reading Comprehension

Commonsense Reasoning

Key Findings

No Model is Truly Reliable

Despite demonstrating advanced capabilities like solving graduate-level problems, every model we tested still makes mistakes on basic tasks. For instance, we found models that could tackle complex calculus problems failing to perform elementary arithmetic or answer simple questions about event sequences. These aren't rare edge cases - the failures occur consistently and predictically.

Current Benchmarks Hide Problems

When we examined popular benchmarks like GSM8K and SVAMP, we found significant rates of errors and ambiguities in the questions themselves. In GSM8K, about 5% of questions contained problems. This benchmark noise has masked true model performance - many reported "errors" were actually correct responses to flawed questions.

Different Models, Different Strengths

While no model achieved perfect performance across our tests, we found interesting patterns in their reliability. OpenAI's o1-mini showed the strongest performance on mathematics, while Claude Sonnet 3.5 excelled at reading comprehension. This suggests that choosing the right model depends on the specific task.

Revising Noisy Benchmarks

Nearly all benchmarks have some level of noise, whether from mislabeled answers or ambigous questions. But to clean them, manually inspecting every example from a benchmark would be extremely time-consuming. To speed up the process, we first show each question to twenty different LLMs, and inspect any question for which any of the models made a mistake. We expect that if a question is ambiguous, models would disagree among themselves, and if a question is mislabeled, models would be likely to have a different answer than the stated solution. So, this approach should catch these kinds of errors. Here are examples of some of the benchmark errors that we found:

Mislabeled question, SVAMP

You had 14 bags with equal number of cookies. If you had 28 cookies and 86 can- dies in total. How many bags of cookies do you have?

Given Answer: 2

There are 14 bags, not 2.

Logical contradiction, GSM8K

Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, putting an equal number of the new cows into each of the twenty stalls. How many cows are in 8 of the stalls?

There are both ten and twenty stalls.

Ambiguity, VQA v2.0

Question: Does the baby have socks on?

There is no way to tell.

Clear flaw / ill-posed, MMLU HS Math

A curve is given parametrically by the equations

Options:

A)π/2 B)π C)2+π D)2π

The curve equations are missing.

Examples of errors in current LLM benchmarks.

It turns out that many “saturated” benchmarks are indeed riddled with errors. Now that we have cleaned up these benchmarks, what can they tell us about LLM reliability?

New Failure Patterns

Through careful analysis of model mistakes, we identified previously unknown patterns of questions where models consistently fail in predictable ways.

First Event Bias

When asked "What happened second: X or Y?" several models consistently answer with the first event - even while explicitly acknowledging in their reasoning that they've identified the first event rather than the second.

Example Question, DROP

[context paragraph] What happened second: Russians blocked Azov or Treaty of Constantinople?

Correct Answer: Treaty of Constantinople

Claude 3.5 Sonnet (Oct)

Answer: Russians blocked Azov

DeepSeek-R1

Answer: Russians blocked Azov

Mistral Large

Answer: The Russians blocked Azov second

Gemini 2.0 Flash

Answer: Russians blocked Azov

Prime Number Effects

We discovered that Claude 3.5 Sonnet frequently makes arithmetic errors when answers are prime numbers or close to prime numbers. For example, when dividing to get a whole number result, the model often incorrectly rounds up if the answer is prime, even though no rounding is needed.

Example Question, SVAMP

The school is planning a field trip. The school has 67 classrooms. There are 66 students in each classroom in the school. If there are 6 seats on each school bus. How many buses are needed to take the trip?

Correct Answer: 737

Claude 3.5 Sonnet (June)

Answer: 738

Conclusion

Our work demonstrates that even the most advanced language models still struggle with basic tasks. The platinum benchmarks we've constructed are an initial step towards evaluating reliability, and we encourage researchers to use them to evaluate their frontier models—but more work is needed to create comprehensive tests that cover a wide range of tasks and domains.

We hope that our work motivates the adoption of platinum benchmarks in evaluating LLMs to ensure they meet the high reliability standards required in real-world applications.

Authors

Joshua Vendrow

Edward Vendrow

Sara Beery

Aleksander Mądry

PlatinumBench

Do Large Language Model Benchmarks Test Reliability?

Live Leaderboard

0.5%

1

2

2

8

3

0.7%

2

5

5

0.8%

1

3

1

2

1

7

2

0.9%

1

1

1

3

5

4

4

1.0%

1

1

2

3

3

7

1.1%

3

1

4

4

14

5

1.1%

1

3

1

1

6

3

21

3

1.1%

3

1

4

3

4

6

4

4

1.3%

1

2

2

8

1

8

1.4%

1

1

1

1

1

1

6

6

1

7

1.5%

Do Large Language Model
Benchmarks Test Reliability?