PlatinumBench
Overview
Error Viewer
Paper
Overview
Error Viewer
Paper
Datasets
(NEW) GSM8K Full Test Set
Math
Full GSM8K-Platinum test set
SingleOp
Math
Single operation arithmetic word problems
SingleEq
Math
Single equation word problems
MultiArith
Math
Simple multi-step arithmetic word problems
SVAMP
Math
Elementary-level math word problems
GSM8K
Math
8th-grade math word problems
MMLU HS Math
Math
High school math
Logic 3-Obj
Logic
3 object logic deduction
Object Counting
Logic
Count quantities of objects from a list
Navigate
Logic
Determine the position of an object after a series of navigation steps
TabFact
Table
Fact verification from Wikipedia tables
Hotpot QA
Reading Comp
Natural, multi-hop reading comprehension questions
SQuAD2.0
Reading Comp
Answerable and unanswerable reading comprehension questions
DROP
Reading Comp
English reading comprehension over paragraphs
Winograd WSC
Commonsense
Common-sense reasoning for word disambiguation
Showing errors on
bbh_navigate
from
claude-3-5-haiku
Pick a model to view errors for:
Mistral Small
Mistral Large
Gemini 1.5 Flash
Gemini 1.5 Pro
Llama 3.1 70B Inst
Llama 3.1 405B Inst
GPT 4o mini
GPT-4o (Nov)
GPT-4o (Aug)
Claude Sonnet 3.5 (June)
Llama 3.3 70B Inst
Grok 2
Claude Sonnet 3.5 (Oct)
Claude Haiku 3.5
Gemini 2.0 Flash
Gemini 2.0 Flash Thinking (12/19)
Deepseek-V3
Qwen2.5-72B-Instruct
Deepseek-R1
o1-mini
o1-preview
o1 (med)
o1 (high)
o3-mini (high)
Gemini 2.0 Pro
Qwen2.5-Max
Claude 3.7 Sonnet
Claude 3.7 Sonnet Thinking 16k
GPT-4.5 (Preview)
Gemini 2.5 Pro (Exp)
Llama 4 Maverick
Grok 3 Beta
Grok 3 Mini Beta (high)
GPT 4.1 (2025-04-14)