bnew

Veteran
Joined
Nov 1, 2015
Messages
64,223
Reputation
9,810
Daps
174,668
Is AI already superhuman at FrontierMath? o4-mini defeats most *teams* of mathematicians in a competition



Posted on Mon May 26 18:21:16 2025 UTC

blifoinf463f1.png



Full rIs AI already superhuman on FrontierMath?.



1/11
@EpochAIResearch
Is AI already superhuman at FrontierMath?

To answer this question, we ran a competition at MIT, pitting eight teams of mathematicians against o4-mini-medium.

Result: o4-mini beat all but two teams. And while AIs aren't yet clearly superhuman, they probably will be soon.

GrqjLHDXcAAmV61.png


2/11
@EpochAIResearch
Our competition included around 40 mathematicians, split into teams of four or five, and with a roughly even mix of subject matter experts and exceptional undergrads on each team. We then gave them 4.5h and internet access to answer 23 challenging FrontierMath questions.

3/11
@EpochAIResearch
By design, FrontierMath draws on a huge range of fields. To obtain a meaningful human baseline that tests reasoning abilities rather than breadth of knowledge, we chose problems that need less background knowledge, or were tailored to the background expertise of participants.

GrqjLGUXEAAvA40.jpg


4/11
@EpochAIResearch
The human teams solved 19% of the problems on average, while o4-mini-medium solved ~22%. But every problem that o4-mini could complete was also solved by at least one human team, and the human teams collectively solved around 35%.

5/11
@EpochAIResearch
But what does this mean for the human baseline on FrontierMath? Since the competition problems weren’t representative of the complete FrontierMath benchmark, we need to adjust these numbers to reflect the full benchmark’s difficulty distribution.

6/11
@EpochAIResearch
Adjusting our competition results for difficulty suggests that the human baseline is 30-50%, but this result seems highly suspect – making the same adjustment to o4-mini predicts that it would get 37% on the full benchmark, compared to 19% from our actual evaluations.

7/11
@EpochAIResearch
Unfortunately, it thus seems hard to get a clear “human baseline” on FrontierMath. But if 30-50% is indeed the relevant human baseline, it seems quite likely that AIs will be superhuman by the end of the year.

8/11
@EpochAIResearch
Read the full analysis here: Is AI already superhuman on FrontierMath?

9/11
@Alice_comfy
Very interesting. Imagine Gemini 2.5 Pro Deepthink is probably the turning point (at least on these kind of contests).

10/11
@NeelNanda5
Qs:
* Why o4-mini-medium, rather than high or o3?
* What happens if you give the LLM pass@8? Automatically checking correctness is easy for maths, I imagine, so this is just de facto more inference time compute (comparing a 5 person team to one LLM is already a bit unfair anyway)

11/11
@sughanthans1
Why not o3


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,223
Reputation
9,810
Daps
174,668
Eric Schmidt predicts that within a year or two, we will have a breakthrough of "super-programmers" and "AI mathematicians"


Posted on Mon May 26 09:33:37 2025 UTC


Video from Haider. on 𝕏:






1/11
@slow_developer
Eric Schmidt predicts that within a year or two, we will have a breakthrough of "super-programmers" and "AI mathematicians"

software is "scale-free" — it doesn’t need real-world input, just code and feedback. try, test, repeat.

AI can run this loop millions of times in minutes

https://video.twimg.com/amplify_video/1926668617321512960/vid/avc1/1080x1080/lw1aTURGOk_psvKi.mp4

2/11
@techikansh
Haider, how do u put captions(subtitles) in ur video??

3/11
@slow_developer
i use /OpusClip

4/11
@petepetrash
It's funny to hear someone confidently claim "super programmers" are a year away a after trying to update a small NextJS 14 project to v15 using state of the art models (o3 / Opus 4) and watching them hit a wall almost immediately.

5/11
@ewgenijwolkow
thats not the definition of scale free

6/11
@MrChrisEllis
Doesn’t need real world input? You mean apart from the electricity, user generated content, cheap labour to make the chips and computers and the rare earth minerals? Maybe /sama could mine them himself in the DRC paid in WorldCoin

Gr0ef6EXIAAtwlu.jpg


7/11
@ezcrypt
Source?

8/11
@TonyIsHere4You
That's true of the logical structure of code, but the point of code in the real world has been to instruct hardware to do something, not engage in rote self-interaction.

9/11
@diligentium
Eric Schmidt looks great!

10/11
@hzdydx9
This changes the game

11/11
@M_Zot_ike
This is why :

Grz0xqRW0AARkD1.jpg

Grz0yOBXwAAWcr4.jpg

Grz0zJwXQAEseHm.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top