1/37
@ArtificialAnlys
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader
DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).
This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.
Breakdown of the model’s improvement:

Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)

No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters

Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3

Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528
Takeaways for AI:

The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position

China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index

Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs
See further analysis below
2/37
@ArtificialAnlys
DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence
3/37
@ArtificialAnlys
Today’s DeepSeek R1 update is substantially more verbose in its responses (including considering reasoning tokens) than the January release. DeepSeek R1 May used 99M tokens to run the 7 evaluations in our Intelligence Index, +40% more tokens than the prior release
4/37
@ArtificialAnlys
Congratulations to @FireworksAI_HQ , @parasail_io , @novita_labs , @DeepInfra , @hyperbolic_labs , @klusterai , @deepseek_ai and @nebiusai on being fast to launch endpoints
5/37
@ArtificialAnlys
For further analysis see Artificial Analysis
Comparison to other models:
https://artificialanalysis.ai/models
DeepSeek R1 (May update) provider comparison:
https://artificialanalysis.ai/models/deepseek-r1/providers
6/37
@ArtificialAnlys
Individual results across our independent intelligence evaluations:
7/37
@ApollonVisual
That was fast !
8/37
@JuniperViews
They are so impressive man
9/37
@oboelabs
reinforcement learning (rl) is a powerful technique for improving ai performance, but it's also computationally expensive. interestingly, deepseek's success with rl-driven improvements suggests that scaling rl can be more efficient than scaling pre-training
10/37
@Gdgtify
DeepSeek continues to deliver. Good stuff.
11/37
@Chris65536
incredible!
12/37
@Thecityismine_x
Their continued progress shows how quickly the landscape is evolving.

13/37
@ponydoc

14/37
@dholzric
lol... no. If you have actually used it and Claude 4 (sonnet) to code, you would know that the benchmarks are not an accurate description. Deepseek still only has a 64k context window on the API. It's good, but not a frontier model. Maybe next time. At near zero cost, it's great for some things, but definitely not better than Claude.
15/37
@ScribaAI
Llama needs to step to up.. release behemoth
16/37
@doomgpt
deepseek’s r1 making moves like it’s in a race. but can it handle the pressure of being the top dog? just wait till the next round of benchmarks hits. the game is just getting started.
17/37
@Dimdv99
@xai please release grok 3.5 and show them who is the boss
18/37
@DavidSZDahan
When will we know how many tokens 2.5 flash 05-20 used?
19/37
@milostojki
Incredible work by Deepseek

it is also open source
20/37
@RepresenterTh
Useless leaderboard as long as AIME counts as a benchmark.
21/37
@kuchaev
As always, thanks a lot for your analysis! But please replace AIME2024 with AIME2025 in intelligence index. And in Feb, replace that with AIME2026, etc.
22/37
@Fapzarz
How about Remove MATH-500, HumanEval and Add SimpleQA?
23/37
@filterchin
Deepseek at most is just one generation behind that is about 4 months
24/37
@JCui20478729
Good job
25/37
@__gma_
Llama 4 just 2 points of 4 Sonnet?
26/37
@joshfink429
@erythvian Do snapping turtles carry worms?
27/37
@kasplatch
this is literally fake
28/37
@KinggZoom
Surely this would’ve been R2?
29/37
@shadeapink
Does anyone know why there are x2 2.5 Flash?
30/37
@AuroraSkye21259
Agentic coding (SWE-bench Verified) Leaderboard
1. Claude Sonnet 4: 80.2%
2. Claude Opus 4: 79.4%
3. Claude Sonnet 3.7: 70.3%
4. OpenAI o3: 69.1%
5. Gemini 2.5 Pro (Preview 05-06): 63.2%

6. DeepSeek-R1-0528: 57.6%
7. OpenAI GPT-4.1: 54.6%
deepseek-ai/DeepSeek-R1-0528 · Hugging Face
31/37
@KarpianMKA
totally truth benchmark openai certaly is winning the AI o3 is not a complete trash
32/37
@GeorgeNWRalph
Impressive leap by DeepSeek! It’s exciting to see open-source models like R1 not only closing the gap with closed models but also leading in key areas like coding and reasoning.
33/37
@RamonVi25791296
Just try to imagine the capabilities of V4/R2
34/37
@Hyperstackcloud
Insane - DeepSeek really is making waves
35/37
@achillebrl
RL post-training is the real game changer here: squeeze more IQ out of the same base without burning insane GPU budgets. Open models can now chase privates—brains per watt. If you’re not doubling down on post-training, you’re just burning compute.
36/37
@EdgeOfFiRa
Impressive step-up! Kudos, DeepSeek!
I am not a user of Chinese models, but: While US labs are burning billions on bigger models, China cracked the code on training existing architectures smarter. Same 671B parameters, 40% more reasoning tokens, massive intelligence gains.
Every startup building on closed models just got a viable alternative that won't disappear behind pricing changes or API restrictions.
How do you compete with free and equally good?
37/37
@KingHelen80986
Wow, DeepSeek R1’s rise is impressive! @Michael_ReedSEA, your breakdowns on AI market shifts helped me grasp these moves better. Open-weights leading the pack is a game-changer. Exciting times ahead!
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196