AI that’s smarter than humans? Americans say a firm “no thank you.”

Black White Sox Hat

Veteran
Supporter
Joined
May 6, 2012
Messages
62,170
Reputation
4,780
Daps
95,396
G4iL.gif
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
66,134
Reputation
10,217
Daps
179,325
OpenAI's Mark Chen: "I still remember the meeting they showed my [CodeForces] score, and said "hey, the model is better than you!" I put decades of my life into this... I'm at the top of my field, and it's already better than me ... It's sobering."



Posted on Sat Jun 7 17:14:16 2025 UTC

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
66,134
Reputation
10,217
Daps
179,325
[Research] AI System Completes 12 Work-Years of Medical Research in 2 Days, Outperforms Human Reviewers



Posted on Thu Jun 19 13:28:36 2025 UTC

/r/OpenAI/comments/1lfau5l/ai_system_completes_12_workyears_of_medical/

Harvard and MIT researchers have developed "otto-SR," an AI system that automates systematic reviews - the gold standard for medical evidence synthesis that typically takes over a year to complete.

Key Findings:

Speed: Reproduced an entire issue of Cochrane Reviews (12 reviews) in 2 days, representing ~12 work-years of traditional research
Accuracy: 93.1% data extraction accuracy vs 79.7% for human reviewers
Screening Performance: 96.7% sensitivity vs 81.7% for human dual-reviewer workflows
Discovery: Found studies that original human reviewers missed (median of 2 additional eligible studies per review)
Impact: Generated newly statistically significant conclusions in 2 reviews, negated significance in 1 review

Why This Matters:

Systematic reviews are critical for evidence-based medicine but are incredibly time-consuming and resource-intensive. This research demonstrates that LLMs can not only match but exceed human performance in this domain.

The implications are significant - instead of waiting years for comprehensive medical evidence synthesis, we could have real-time, continuously updated reviews that inform clinical decision-making much faster.

The system incorrectly excluded a median of 0 studies across all Cochrane reviews tested, suggesting it's both more accurate and more comprehensive than traditional human workflows.

This could fundamentally change how medical research is synthesized and how quickly new evidence reaches clinical practice.

https://www.medrxiv.org/content/10.1101/2025.06.13.25329541v1.full.pdf
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
66,134
Reputation
10,217
Daps
179,325
Exhausted man defeats AI model in world coding championship: "Humanity has prevailed (for now!)," writes winner after 10-hour coding marathon against OpenAI.



Posted on Fri Jul 18 20:45:59 2025 UTC

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
66,134
Reputation
10,217
Daps
179,325











1/37
@alexwei_
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).



GwLl5lhXIAAXl5p.jpg


2/37
@alexwei_
2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.



GwLmFYaW8AAvAA1.png


3/37
@alexwei_
3/N Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).



4/37
@alexwei_
4/N Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.



GwLm5SCbIAAfkMF.png

GwLtrPeWIAUMDYI.png

GwLuozlWcAApPEc.png

GwLuqR1XQAAdSl6.png


5/37
@alexwei_
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.



6/37
@alexwei_
6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold! 🥇



7/37
@alexwei_
7/N HUGE congratulations to the team—@SherylHsu02, @polynoamial, and the many giants whose shoulders we stood on—for turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.



8/37
@alexwei_
8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.



9/37
@alexwei_
9/N Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.



GwLv06_bwAAZbsl.jpg


10/37
@alexwei_
10/N If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model 😅)

GitHub - aw31/openai-imo-2025-proofs



11/37
@alexwei_
11/N Lastly, we'd like to congratulate all the participants of the 2025 IMO on their achievement! We are proud to have many past IMO participants at @OpenAI and recognize that these are some of the brightest young minds of the future.



12/37
@burny_tech
Soooo what is the breakthrough?
>"Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians."
>"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."



GwNe8zWXUAARRMA.jpg


13/37
@burny_tech
so let me get this straight

their model basically competed live on IMO so all the mathematical tasks should be novel enough

all previous years IMO tasks in benchmarks are fully saturated in big part because of data contamination as it doesn't generalize to these new ones

so... this new model seems to... generalize well to novel enough mathematical tasks??? i dont know what to think



14/37
@AlbertQJiang
Congratulations!



15/37
@geo58928
Amazing



16/37
@burny_tech
So public AI models are bad at IMO, while internal models are getting gold medals? Fascinating



GwNY40YXYAEUP0W.jpg


17/37
@mhdfaran
@grok who was on second and third



18/37
@QuanquanGu
Congrats, this is incredible results!
Quick question: did it use Lean, or just LLM?
If it’s just LLM… that’s insane.



19/37
@AISafetyMemes
So what's the next goalpost?

What's the next thing LLMs will never be able to do?



20/37
@kimmonismus
Absolutely fantastic



21/37
@CtrlAltDwayne
pretty impressive. is this the anonymous chatbot we're seeing on webdev arena by chance?



22/37
@burny_tech
lmao



GwNf8atXAAEAgAF.jpg


23/37
@jack_w_rae
Congratulations! That's an incredible result, and a great moment for AI progress. You guys should release the model



24/37
@Kyrannio
Incredible work.



25/37
@burny_tech
Sweet Bitter lesson



GwNofKJX0AAknI6.png


26/37
@burny_tech
"We developed new techniques that make LLMs a lot better at hard-to-verify tasks."
A general method? Or just for mathematical proofs? Is Lean somehow used, maybe just in training?



27/37
@elder_plinius




28/37
@skominers
🙌🙌🙌🙌🙌🙌



29/37
@javilopen
Hey @GaryMarcus, what are your thoughts about this?



30/37
@pr0me
crazy feat, congrats!
nice that you have published the data on this



31/37
@danielhanchen
Impressive!



32/37
@IamEmily2050
Congratulations 🙏



33/37
@burny_tech
Step towards mathematical superintelligence



34/37
@reach_vb
Massive feat! I love how concise and to the point the generations are unlike majority of LLMs open/ closed alike 😁



35/37
@DCbuild3r
Congratulations!



36/37
@DoctorYev
I just woke up and this post has 1M views after a few hours.

AI does not sleep.



37/37
@AndiBunari1
@grok summarize this and simple to understand




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196










1/10
@polynoamial
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵

[Quoted tweet]
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).


GwLl5lhXIAAXl5p.jpg


2/10
@polynoamial
Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.



3/10
@polynoamial
So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.



4/10
@polynoamial
Also this model thinks for a *long* time. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

[Quoted tweet]
@OpenAI's o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots


GXSs0RuWkAElX2T.png


5/10
@polynoamial
It’s worth reflecting on just how fast AI progress has been, especially in math. In 2024, AI labs were using grade school math (GSM8K) as an eval in their model releases. Since then, we’ve saturated the (high school) MATH benchmark, then AIME, and now are at IMO gold.



6/10
@polynoamial
Where does this go? As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery. There’s a big difference between AI slightly below top human performance vs slightly above.



7/10
@polynoamial
This was a small team effort led by @alexwei_. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community.



8/10
@polynoamial
When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is.



9/10
@posedscaredcity
But yann lec00n says accuracy scales inversely to output length and im sure industry expert gary marcus would agree



10/10
@mrlnonai
will API cost be astronomical for this?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
66,134
Reputation
10,217
Daps
179,325
"What if AI gets so smart that the President of the United States cannot do better than following ChatGPT-7's recommendation, but can't really understand it either? What if I can't make a better decision about how to run OpenAI and just say, 'You know what, ChatGPT-7, you're in charge. Good luck."



Posted on Thu Jul 24 01:54:41 2025 UTC



Commented on Thu Jul 24 02:05:50 2025 UTC

Link to the full video please


│ Commented on Thu Jul 24 02:08:43 2025 UTC

LIVE: OpenAI CEO Sam Altman speaks with Fed’s Michelle Bowman on bank capital rules
 
Top