1/11
@EpochAIResearch
Is AI already superhuman at FrontierMath?
To answer this question, we ran a competition at MIT, pitting eight teams of mathematicians against o4-mini-medium.
Result: o4-mini beat all but two teams. And while AIs aren't yet clearly superhuman, they probably will be soon.
2/11
@EpochAIResearch
Our competition included around 40 mathematicians, split into teams of four or five, and with a roughly even mix of subject matter experts and exceptional undergrads on each team. We then gave them 4.5h and internet access to answer 23 challenging FrontierMath questions.
3/11
@EpochAIResearch
By design, FrontierMath draws on a huge range of fields. To obtain a meaningful human baseline that tests reasoning abilities rather than breadth of knowledge, we chose problems that need less background knowledge, or were tailored to the background expertise of participants.
4/11
@EpochAIResearch
The human teams solved 19% of the problems on average, while o4-mini-medium solved ~22%. But every problem that o4-mini could complete was also solved by at least one human team, and the human teams collectively solved around 35%.
5/11
@EpochAIResearch
But what does this mean for the human baseline on FrontierMath? Since the competition problems weren’t representative of the complete FrontierMath benchmark, we need to adjust these numbers to reflect the full benchmark’s difficulty distribution.
6/11
@EpochAIResearch
Adjusting our competition results for difficulty suggests that the human baseline is 30-50%, but this result seems highly suspect – making the same adjustment to o4-mini predicts that it would get 37% on the full benchmark, compared to 19% from our actual evaluations.
7/11
@EpochAIResearch
Unfortunately, it thus seems hard to get a clear “human baseline” on FrontierMath. But if 30-50% is indeed the relevant human baseline, it seems quite likely that AIs will be superhuman by the end of the year.
8/11
@EpochAIResearch
Read the full analysis here:
Is AI already superhuman on FrontierMath?
9/11
@Alice_comfy
Very interesting. Imagine Gemini 2.5 Pro Deepthink is probably the turning point (at least on these kind of contests).
10/11
@NeelNanda5
Qs:
* Why o4-mini-medium, rather than high or o3?
* What happens if you give the LLM pass@8? Automatically checking correctness is easy for maths, I imagine, so this is just de facto more inference time compute (comparing a 5 person team to one LLM is already a bit unfair anyway)
11/11
@sughanthans1
Why not o3
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196