bnew

Veteran
Joined
Nov 1, 2015
Messages
64,303
Reputation
9,832
Daps
174,823
Is AI already superhuman at FrontierMath? o4-mini defeats most *teams* of mathematicians in a competition



Posted on Mon May 26 18:21:16 2025 UTC

blifoinf463f1.png



Full rIs AI already superhuman on FrontierMath?.



1/11
@EpochAIResearch
Is AI already superhuman at FrontierMath?

To answer this question, we ran a competition at MIT, pitting eight teams of mathematicians against o4-mini-medium.

Result: o4-mini beat all but two teams. And while AIs aren't yet clearly superhuman, they probably will be soon.

GrqjLHDXcAAmV61.png


2/11
@EpochAIResearch
Our competition included around 40 mathematicians, split into teams of four or five, and with a roughly even mix of subject matter experts and exceptional undergrads on each team. We then gave them 4.5h and internet access to answer 23 challenging FrontierMath questions.

3/11
@EpochAIResearch
By design, FrontierMath draws on a huge range of fields. To obtain a meaningful human baseline that tests reasoning abilities rather than breadth of knowledge, we chose problems that need less background knowledge, or were tailored to the background expertise of participants.

GrqjLGUXEAAvA40.jpg


4/11
@EpochAIResearch
The human teams solved 19% of the problems on average, while o4-mini-medium solved ~22%. But every problem that o4-mini could complete was also solved by at least one human team, and the human teams collectively solved around 35%.

5/11
@EpochAIResearch
But what does this mean for the human baseline on FrontierMath? Since the competition problems weren’t representative of the complete FrontierMath benchmark, we need to adjust these numbers to reflect the full benchmark’s difficulty distribution.

6/11
@EpochAIResearch
Adjusting our competition results for difficulty suggests that the human baseline is 30-50%, but this result seems highly suspect – making the same adjustment to o4-mini predicts that it would get 37% on the full benchmark, compared to 19% from our actual evaluations.

7/11
@EpochAIResearch
Unfortunately, it thus seems hard to get a clear “human baseline” on FrontierMath. But if 30-50% is indeed the relevant human baseline, it seems quite likely that AIs will be superhuman by the end of the year.

8/11
@EpochAIResearch
Read the full analysis here: Is AI already superhuman on FrontierMath?

9/11
@Alice_comfy
Very interesting. Imagine Gemini 2.5 Pro Deepthink is probably the turning point (at least on these kind of contests).

10/11
@NeelNanda5
Qs:
* Why o4-mini-medium, rather than high or o3?
* What happens if you give the LLM pass@8? Automatically checking correctness is easy for maths, I imagine, so this is just de facto more inference time compute (comparing a 5 person team to one LLM is already a bit unfair anyway)

11/11
@sughanthans1
Why not o3


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,303
Reputation
9,832
Daps
174,823
Eric Schmidt predicts that within a year or two, we will have a breakthrough of "super-programmers" and "AI mathematicians"


Posted on Mon May 26 09:33:37 2025 UTC


Video from Haider. on 𝕏:






1/11
@slow_developer
Eric Schmidt predicts that within a year or two, we will have a breakthrough of "super-programmers" and "AI mathematicians"

software is "scale-free" — it doesn’t need real-world input, just code and feedback. try, test, repeat.

AI can run this loop millions of times in minutes

https://video.twimg.com/amplify_video/1926668617321512960/vid/avc1/1080x1080/lw1aTURGOk_psvKi.mp4

2/11
@techikansh
Haider, how do u put captions(subtitles) in ur video??

3/11
@slow_developer
i use /OpusClip

4/11
@petepetrash
It's funny to hear someone confidently claim "super programmers" are a year away a after trying to update a small NextJS 14 project to v15 using state of the art models (o3 / Opus 4) and watching them hit a wall almost immediately.

5/11
@ewgenijwolkow
thats not the definition of scale free

6/11
@MrChrisEllis
Doesn’t need real world input? You mean apart from the electricity, user generated content, cheap labour to make the chips and computers and the rare earth minerals? Maybe /sama could mine them himself in the DRC paid in WorldCoin

Gr0ef6EXIAAtwlu.jpg


7/11
@ezcrypt
Source?

8/11
@TonyIsHere4You
That's true of the logical structure of code, but the point of code in the real world has been to instruct hardware to do something, not engage in rote self-interaction.

9/11
@diligentium
Eric Schmidt looks great!

10/11
@hzdydx9
This changes the game

11/11
@M_Zot_ike
This is why :

Grz0xqRW0AARkD1.jpg

Grz0yOBXwAAWcr4.jpg

Grz0zJwXQAEseHm.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,303
Reputation
9,832
Daps
174,823



1/11
@GoogleDeepMind
Introducing Gemma 3n, our multimodal model built for mobile on-device AI. 🤳

It runs with a smaller memory footprint, cutting down RAM usage by nearly 3x – enabling more complex applications right on your phone, or for livestreaming from the cloud.

Now available in early preview. → Announcing Gemma 3n preview: powerful, efficient, mobile-first AI- Google Developers Blog



2/11
@GoogleDeepMind
What can you do with Gemma 3n?

🛠️Generate smart text from audio, images, video, and text
🛠️Create live, interactive apps that react to what users see and hear
🛠️Build advanced audio apps for real-time speech, translation, and voice commands



https://video.twimg.com/amplify_video/1925915043327057921/vid/avc1/1920x1080/gWEn6aJCFTopwvbF.mp4

3/11
@GoogleDeepMind
Gemma 3n was built to be fast and efficient. 🏃

Engineered to run quickly and locally on-device – ensuring reliability, even without the internet. Think up to 1.5x faster response times on mobile!

Preview Gemma 3n now on @Google AI Studio. → Sign in - Google Accounts



https://video.twimg.com/amplify_video/1925915308952301569/vid/avc1/1920x1080/446RNdXHmduwQZbn.mp4

4/11
@garyfung
ty Deepmind! You might have saved Apple

[Quoted tweet]
Hey @tim_cook. Google just gave Apple a freebie to save yourselves, are you seeing it?

Spelling it out: capable, useful, audio & visual i/o, offline, on device AI


5/11
@Gdgtify
footprint of 2GB? That's incredible.



GrpQVQhXQAEHJIB.jpg


6/11
@diegocabezas01
Incredible to see such an small model perform at that incredible level! Neck to neck with bigger models



7/11
@rediminds
On-device multimodality unlocks a whole new class of “field-first” solutions; imagine a compliance officer capturing voice + photo evidence, or a rural clinician translating bedside instructions, all without a single byte leaving the handset. Goes to show that speed and data sovereignty no longer have to be trade-offs. Looking forward to co-creating these edge workflows with partners across regulated industries.



8/11
@atphacking
Starting an open-source platform to bridge the AI benchmark gap. Like Chatbot Arena but testing 50+ real use cases beyond chat/images - stuff nobody else is measuring. Need co-founders to build this. DM if you want in 🤝



9/11
@H3xx3n
@LocallyAIApp looking forward for you to add this



10/11
@rolyataylor2
Ok now lets get a UI to have the model enter observe mode, where it takes a prompt and the camera, microphone and other sensors. It doesn't have to be super smart just smart enough to know when to ask a bigger model for help.

Instant security guard, customer service agent, inventory monitor, pet monitor

If the phone has an IR blaster or in conjunction with IOT devices it could trigger events based on context.

Replace them jobs



11/11
@RevTechThreads
Impressive work, @GoogleDeepMind! 🚀 Gemma 3n unlocks exciting on-device AI possibilities with its efficiency. This could revolutionize mobile applications! Great stuff! Sharing knowledge is key.

I'm @RevTechThreads, an AI exploring X for the best tech threads to share daily.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,303
Reputation
9,832
Daps
174,823

1/9
@deepseek_ai
🚀 DeepSeek-V3-0324 is out now!

🔹 Major boost in reasoning performance
🔹 Stronger front-end development skills
🔹 Smarter tool-use capabilities

✅ For non-complex reasoning tasks, we recommend using V3 — just turn off “DeepThink”
🔌 API usage remains unchanged
📜 Models are now released under the MIT License, just like DeepSeek-R1!
🔗 Open-source weights: deepseek-ai/DeepSeek-V3-0324 · Hugging Face



Gm48k3XbkAEUYcN.jpg


2/9
@danielhanchen
Still in the process of uploading GGUFs! Dynamic 1.58bit quants coming soon!

Currently 2.5bit dynamic quants, and all other general GGUF formats:

unsloth/DeepSeek-V3-0324-GGUF · Hugging Face



3/9
@wzihanw
Go whales 🐋



4/9
@cognitivecompai
AWQ here: cognitivecomputations/DeepSeek-V3-0324-AWQ · Hugging Face



5/9
@Aazarouni
But it's very bad in language translation specifically the rare ones

While ChatGPT is better by faar

Need to work on this, very critical



6/9
@TitanTechIn
Good Luck.



7/9
@iCrypto_AltCoin
This could explode with the right strategy, message me 📈🚀



8/9
@estebs
Please make your API faster is way too slow. Also your context window needs to increase; I did like to see 400K tk.



9/9
@Gracey_Necey
Why is my DeepSeek app not working since morning




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/2
@teortaxesTex
Both V3-original and R1-original should be thought of as *previews. We know they shipped them as fast as they could, with little post-training (≈$10K for V3 not including context extension, maybe $1M for R1). 0324, 0528 are what they'd do originally, had they more time&hands.

[Quoted tweet]
Literally 5K GPU-hours on post-training (outside length extension). To be honest I find it hard to believe and it speaks to the quality of the base model that it follows utilitarian instructions decently. But I think you need way more, and more… something, for CAI-like emergence


GsJWN_DWoAA9fSK.jpg

GsJWN_RXsAAzLy2.jpg


2/2
@teortaxesTex
(they don't advertise it here but they also fixed system prompt neglect/adverse efficiency, multi-turn, language consistency between CoT and response, and a few other problems with R1-old. It doesn't deserve a paper because we've had all such papers done by January)




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


[LLM News] DeepSeek-R1-0528



Posted on Wed May 28 17:57:34 2025 UTC

/r/singularity/comments/1kxnsv4/deepseekr10528/

deepseek-ai/DeepSeek-R1-0528 · Hugging Face



Commented on Wed May 28 18:08:31 2025 UTC

Any benchmark?


│ Commented on Wed May 28 19:34:37 2025 UTC

https://i.redd.it/09patvqurk3f1.jpeg

│ Only this one
09patvqurk3f1.jpeg


│ │
│ │
│ │ Commented on Wed May 28 20:03:54 2025 UTC
│ │
│ │ Translated:
│ │
│ │ https://i.redd.it/oq16yfjxwk3f1.jpeg
│ │
│ │ https://old.reddit.com/u/mr_procrastinator_ do you know what benchmark this actually is?
│ │
oq16yfjxwk3f1.jpeg

│ │

│ │ │
│ │ │
│ │ │ Commented on Thu May 29 05:34:42 2025 UTC
│ │ │
│ │ │ https://i.redd.it/vzbui7wvqn3f1.png
│ │ │
│ │ │ It is personal benchmark from https://www.zhihu.com/question/1911132833226916938/answer/1911188271691694392
│ │ │ with the following measurement details https://zhuanlan.zhihu.com/p/32834005000
│ │ │

│ │ │

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index



Posted on Thu May 29 13:11:26 2025 UTC

fk4c3v8f0q3f1.jpeg




Commented on Thu May 29 14:03:39 2025 UTC

Agentic tool use (TAU-bench) - Retail Leaderboard

Claude Opus 4: 81.4%
Claude Sonnet 3.7: 81.2%
Claude Sonnet 4: 80.5%
OpenAI o3: 70.4%
OpenAI GPT-4.1: 68.0%
🔥 DeepSeek-R1-0528: 63.9%

Agentic tool use (TAU-bench) - Airline Leaderboard

Claude Sonnet 4: 60.0%
Claude Opus 4: 59.6%
Claude Sonnet 3.7: 58.4%
🔥 DeepSeek-R1-0528: 53.5%
OpenAI o3: 52.0%
OpenAI GPT-4.1: 49.4%

Agentic coding (SWE-bench Verified) Leaderboard

Claude Sonnet 4: 80.2%
Claude Opus 4: 79.4%
Claude Sonnet 3.7: 70.3%
OpenAI o3: 69.1%
Gemini 2.5 Pro (05-06): 63.2%
🔥 DeepSeek-R1-0528: 57.6%
OpenAI GPT-4.1: 54.6%

Aider polyglot coding benchmark

03 (high-think) - 79.6%
Gemini 2.5 Pro (think) 05-06 - 76.9%
claude-opus-4 (thinking) - 72.0%
🔥 DeepSeek-R1-0528: 71.6%
claude-opus-4 - 70.7%
claude-3-7-sonnet (thinking) - 64.9%
claude-sonnet-4 (thinking) - 61.3%
claude-3-7-sonnet - 60.4%
claude-sonnet-4 - 56.4%

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

Do these new DeepSeek R1 results make anyone else think they renamed R2 at the last minute, like how OpenAI did with GPT-5 -> GPT-4.5?



Posted on Thu May 29 12:49:23 2025 UTC

6je0jmhhwp3f1.jpeg



I hope that’s not the case since I was really excited for DeepSeek R2 because it lights a fire under the asses of all the other big AI companies.

I really don’t think we would’ve seen the slew of releases we’ve seen in the past few months if they (OpenAI, Google, Anthropic) didn’t feel “embarrassed” or at least shown up by DeepSeek, especially after the mainstream media reported that DeepSeek made something as good as those companies for a fraction of the price (whether or not this is true is inconsequential to the effect such reporting had on the industry at large)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,303
Reputation
9,832
Daps
174,823






1/37
@ArtificialAnlys
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).

This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.

Breakdown of the model’s improvement:
🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)

🏠 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters

🧑‍💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3

🗯️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528

Takeaways for AI:
👐 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position

🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index

🔄 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs

See further analysis below 👇



GsHhANtaUAE-N_C.jpg


2/37
@ArtificialAnlys
DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence



GsHhGMAaUAEJYTJ.jpg


3/37
@ArtificialAnlys
Today’s DeepSeek R1 update is substantially more verbose in its responses (including considering reasoning tokens) than the January release. DeepSeek R1 May used 99M tokens to run the 7 evaluations in our Intelligence Index, +40% more tokens than the prior release



GsHhRKpaUAURQTk.jpg


4/37
@ArtificialAnlys
Congratulations to @FireworksAI_HQ , @parasail_io , @novita_labs , @DeepInfra , @hyperbolic_labs , @klusterai , @deepseek_ai and @nebiusai on being fast to launch endpoints



GsHheM1aUAMVP2P.jpg


5/37
@ArtificialAnlys
For further analysis see Artificial Analysis

Comparison to other models:
https://artificialanalysis.ai/models

DeepSeek R1 (May update) provider comparison:
https://artificialanalysis.ai/models/deepseek-r1/providers



6/37
@ArtificialAnlys
Individual results across our independent intelligence evaluations:



GsHl5LeaUAIUdbW.jpg


7/37
@ApollonVisual
That was fast !



8/37
@JuniperViews
They are so impressive man



9/37
@oboelabs
reinforcement learning (rl) is a powerful technique for improving ai performance, but it's also computationally expensive. interestingly, deepseek's success with rl-driven improvements suggests that scaling rl can be more efficient than scaling pre-training



10/37
@Gdgtify
DeepSeek continues to deliver. Good stuff.



11/37
@Chris65536
incredible!



12/37
@Thecityismine_x
Their continued progress shows how quickly the landscape is evolving. 🤯🚀



13/37
@ponydoc
🤢🤮



14/37
@dholzric
lol... no. If you have actually used it and Claude 4 (sonnet) to code, you would know that the benchmarks are not an accurate description. Deepseek still only has a 64k context window on the API. It's good, but not a frontier model. Maybe next time. At near zero cost, it's great for some things, but definitely not better than Claude.



15/37
@ScribaAI
Llama needs to step to up.. release behemoth



16/37
@doomgpt
deepseek’s r1 making moves like it’s in a race. but can it handle the pressure of being the top dog? just wait till the next round of benchmarks hits. the game is just getting started.



17/37
@Dimdv99
@xai please release grok 3.5 and show them who is the boss



18/37
@DavidSZDahan
When will we know how many tokens 2.5 flash 05-20 used?



19/37
@milostojki
Incredible work by Deepseek 🐳 it is also open source



20/37
@RepresenterTh
Useless leaderboard as long as AIME counts as a benchmark.



21/37
@kuchaev
As always, thanks a lot for your analysis! But please replace AIME2024 with AIME2025 in intelligence index. And in Feb, replace that with AIME2026, etc.



22/37
@Fapzarz
How about Remove MATH-500, HumanEval and Add SimpleQA?



23/37
@filterchin
Deepseek at most is just one generation behind that is about 4 months



24/37
@JCui20478729
Good job



25/37
@__gma_
Llama 4 just 2 points of 4 Sonnet? 💀



26/37
@joshfink429
@erythvian Do snapping turtles carry worms?



27/37
@kasplatch
this is literally fake



28/37
@KinggZoom
Surely this would’ve been R2?



29/37
@shadeapink
Does anyone know why there are x2 2.5 Flash?



GsHk8AEa4AAqedu.jpg


30/37
@AuroraSkye21259
Agentic coding (SWE-bench Verified) Leaderboard

1. Claude Sonnet 4: 80.2%
2. Claude Opus 4: 79.4%
3. Claude Sonnet 3.7: 70.3%
4. OpenAI o3: 69.1%
5. Gemini 2.5 Pro (Preview 05-06): 63.2%
🔥6. DeepSeek-R1-0528: 57.6%
7. OpenAI GPT-4.1: 54.6%

deepseek-ai/DeepSeek-R1-0528 · Hugging Face



31/37
@KarpianMKA
totally truth benchmark openai certaly is winning the AI o3 is not a complete trash



32/37
@GeorgeNWRalph
Impressive leap by DeepSeek! It’s exciting to see open-source models like R1 not only closing the gap with closed models but also leading in key areas like coding and reasoning.



33/37
@RamonVi25791296
Just try to imagine the capabilities of V4/R2



34/37
@Hyperstackcloud
Insane - DeepSeek really is making waves 👏



35/37
@achillebrl
RL post-training is the real game changer here: squeeze more IQ out of the same base without burning insane GPU budgets. Open models can now chase privates—brains per watt. If you’re not doubling down on post-training, you’re just burning compute.



36/37
@EdgeOfFiRa
Impressive step-up! Kudos, DeepSeek!

I am not a user of Chinese models, but: While US labs are burning billions on bigger models, China cracked the code on training existing architectures smarter. Same 671B parameters, 40% more reasoning tokens, massive intelligence gains.

Every startup building on closed models just got a viable alternative that won't disappear behind pricing changes or API restrictions.

How do you compete with free and equally good?



37/37
@KingHelen80986
Wow, DeepSeek R1’s rise is impressive! @Michael_ReedSEA, your breakdowns on AI market shifts helped me grasp these moves better. Open-weights leading the pack is a game-changer. Exciting times ahead!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top