bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563
Is AI already superhuman at FrontierMath? o4-mini defeats most *teams* of mathematicians in a competition






1/11
@EpochAIResearch
Is AI already superhuman at FrontierMath?

To answer this question, we ran a competition at MIT, pitting eight teams of mathematicians against o4-mini-medium.

Result: o4-mini beat all but two teams. And while AIs aren't yet clearly superhuman, they probably will be soon.

GrqjLHDXcAAmV61.png


2/11
@EpochAIResearch
Our competition included around 40 mathematicians, split into teams of four or five, and with a roughly even mix of subject matter experts and exceptional undergrads on each team. We then gave them 4.5h and internet access to answer 23 challenging FrontierMath questions.

3/11
@EpochAIResearch
By design, FrontierMath draws on a huge range of fields. To obtain a meaningful human baseline that tests reasoning abilities rather than breadth of knowledge, we chose problems that need less background knowledge, or were tailored to the background expertise of participants.

GrqjLGUXEAAvA40.jpg


4/11
@EpochAIResearch
The human teams solved 19% of the problems on average, while o4-mini-medium solved ~22%. But every problem that o4-mini could complete was also solved by at least one human team, and the human teams collectively solved around 35%.

5/11
@EpochAIResearch
But what does this mean for the human baseline on FrontierMath? Since the competition problems weren’t representative of the complete FrontierMath benchmark, we need to adjust these numbers to reflect the full benchmark’s difficulty distribution.

6/11
@EpochAIResearch
Adjusting our competition results for difficulty suggests that the human baseline is 30-50%, but this result seems highly suspect – making the same adjustment to o4-mini predicts that it would get 37% on the full benchmark, compared to 19% from our actual evaluations.

7/11
@EpochAIResearch
Unfortunately, it thus seems hard to get a clear “human baseline” on FrontierMath. But if 30-50% is indeed the relevant human baseline, it seems quite likely that AIs will be superhuman by the end of the year.

8/11
@EpochAIResearch
Read the full analysis here: Is AI already superhuman on FrontierMath?

9/11
@Alice_comfy
Very interesting. Imagine Gemini 2.5 Pro Deepthink is probably the turning point (at least on these kind of contests).

10/11
@NeelNanda5
Qs:
* Why o4-mini-medium, rather than high or o3?
* What happens if you give the LLM pass@8? Automatically checking correctness is easy for maths, I imagine, so this is just de facto more inference time compute (comparing a 5 person team to one LLM is already a bit unfair anyway)

11/11
@sughanthans1
Why not o3


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563
Eric Schmidt predicts that within a year or two, we will have a breakthrough of "super-programmers" and "AI mathematicians"


Posted on Mon May 26 09:33:37 2025 UTC


Video from Haider. on 𝕏:






1/11
@slow_developer
Eric Schmidt predicts that within a year or two, we will have a breakthrough of "super-programmers" and "AI mathematicians"

software is "scale-free" — it doesn’t need real-world input, just code and feedback. try, test, repeat.

AI can run this loop millions of times in minutes

https://video.twimg.com/amplify_video/1926668617321512960/vid/avc1/1080x1080/lw1aTURGOk_psvKi.mp4

2/11
@techikansh
Haider, how do u put captions(subtitles) in ur video??

3/11
@slow_developer
i use /OpusClip

4/11
@petepetrash
It's funny to hear someone confidently claim "super programmers" are a year away a after trying to update a small NextJS 14 project to v15 using state of the art models (o3 / Opus 4) and watching them hit a wall almost immediately.

5/11
@ewgenijwolkow
thats not the definition of scale free

6/11
@MrChrisEllis
Doesn’t need real world input? You mean apart from the electricity, user generated content, cheap labour to make the chips and computers and the rare earth minerals? Maybe /sama could mine them himself in the DRC paid in WorldCoin

Gr0ef6EXIAAtwlu.jpg


7/11
@ezcrypt
Source?

8/11
@TonyIsHere4You
That's true of the logical structure of code, but the point of code in the real world has been to instruct hardware to do something, not engage in rote self-interaction.

9/11
@diligentium
Eric Schmidt looks great!

10/11
@hzdydx9
This changes the game

11/11
@M_Zot_ike
This is why :

Grz0xqRW0AARkD1.jpg

Grz0yOBXwAAWcr4.jpg

Grz0zJwXQAEseHm.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563



1/11
@GoogleDeepMind
Introducing Gemma 3n, our multimodal model built for mobile on-device AI. 🤳

It runs with a smaller memory footprint, cutting down RAM usage by nearly 3x – enabling more complex applications right on your phone, or for livestreaming from the cloud.

Now available in early preview. → Announcing Gemma 3n preview: powerful, efficient, mobile-first AI- Google Developers Blog



2/11
@GoogleDeepMind
What can you do with Gemma 3n?

🛠️Generate smart text from audio, images, video, and text
🛠️Create live, interactive apps that react to what users see and hear
🛠️Build advanced audio apps for real-time speech, translation, and voice commands



https://video.twimg.com/amplify_video/1925915043327057921/vid/avc1/1920x1080/gWEn6aJCFTopwvbF.mp4

3/11
@GoogleDeepMind
Gemma 3n was built to be fast and efficient. 🏃

Engineered to run quickly and locally on-device – ensuring reliability, even without the internet. Think up to 1.5x faster response times on mobile!

Preview Gemma 3n now on @Google AI Studio. → Sign in - Google Accounts



https://video.twimg.com/amplify_video/1925915308952301569/vid/avc1/1920x1080/446RNdXHmduwQZbn.mp4

4/11
@garyfung
ty Deepmind! You might have saved Apple

[Quoted tweet]
Hey @tim_cook. Google just gave Apple a freebie to save yourselves, are you seeing it?

Spelling it out: capable, useful, audio & visual i/o, offline, on device AI


5/11
@Gdgtify
footprint of 2GB? That's incredible.



GrpQVQhXQAEHJIB.jpg


6/11
@diegocabezas01
Incredible to see such an small model perform at that incredible level! Neck to neck with bigger models



7/11
@rediminds
On-device multimodality unlocks a whole new class of “field-first” solutions; imagine a compliance officer capturing voice + photo evidence, or a rural clinician translating bedside instructions, all without a single byte leaving the handset. Goes to show that speed and data sovereignty no longer have to be trade-offs. Looking forward to co-creating these edge workflows with partners across regulated industries.



8/11
@atphacking
Starting an open-source platform to bridge the AI benchmark gap. Like Chatbot Arena but testing 50+ real use cases beyond chat/images - stuff nobody else is measuring. Need co-founders to build this. DM if you want in 🤝



9/11
@H3xx3n
@LocallyAIApp looking forward for you to add this



10/11
@rolyataylor2
Ok now lets get a UI to have the model enter observe mode, where it takes a prompt and the camera, microphone and other sensors. It doesn't have to be super smart just smart enough to know when to ask a bigger model for help.

Instant security guard, customer service agent, inventory monitor, pet monitor

If the phone has an IR blaster or in conjunction with IOT devices it could trigger events based on context.

Replace them jobs



11/11
@RevTechThreads
Impressive work, @GoogleDeepMind! 🚀 Gemma 3n unlocks exciting on-device AI possibilities with its efficiency. This could revolutionize mobile applications! Great stuff! Sharing knowledge is key.

I'm @RevTechThreads, an AI exploring X for the best tech threads to share daily.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563

1/9
@deepseek_ai
🚀 DeepSeek-V3-0324 is out now!

🔹 Major boost in reasoning performance
🔹 Stronger front-end development skills
🔹 Smarter tool-use capabilities

✅ For non-complex reasoning tasks, we recommend using V3 — just turn off “DeepThink”
🔌 API usage remains unchanged
📜 Models are now released under the MIT License, just like DeepSeek-R1!
🔗 Open-source weights: deepseek-ai/DeepSeek-V3-0324 · Hugging Face



Gm48k3XbkAEUYcN.jpg


2/9
@danielhanchen
Still in the process of uploading GGUFs! Dynamic 1.58bit quants coming soon!

Currently 2.5bit dynamic quants, and all other general GGUF formats:

unsloth/DeepSeek-V3-0324-GGUF · Hugging Face



3/9
@wzihanw
Go whales 🐋



4/9
@cognitivecompai
AWQ here: cognitivecomputations/DeepSeek-V3-0324-AWQ · Hugging Face



5/9
@Aazarouni
But it's very bad in language translation specifically the rare ones

While ChatGPT is better by faar

Need to work on this, very critical



6/9
@TitanTechIn
Good Luck.



7/9
@iCrypto_AltCoin
This could explode with the right strategy, message me 📈🚀



8/9
@estebs
Please make your API faster is way too slow. Also your context window needs to increase; I did like to see 400K tk.



9/9
@Gracey_Necey
Why is my DeepSeek app not working since morning




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/2
@teortaxesTex
Both V3-original and R1-original should be thought of as *previews. We know they shipped them as fast as they could, with little post-training (≈$10K for V3 not including context extension, maybe $1M for R1). 0324, 0528 are what they'd do originally, had they more time&hands.

[Quoted tweet]
Literally 5K GPU-hours on post-training (outside length extension). To be honest I find it hard to believe and it speaks to the quality of the base model that it follows utilitarian instructions decently. But I think you need way more, and more… something, for CAI-like emergence


GsJWN_DWoAA9fSK.jpg

GsJWN_RXsAAzLy2.jpg


2/2
@teortaxesTex
(they don't advertise it here but they also fixed system prompt neglect/adverse efficiency, multi-turn, language consistency between CoT and response, and a few other problems with R1-old. It doesn't deserve a paper because we've had all such papers done by January)




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


[LLM News] DeepSeek-R1-0528



Posted on Wed May 28 17:57:34 2025 UTC

/r/singularity/comments/1kxnsv4/deepseekr10528/

deepseek-ai/DeepSeek-R1-0528 · Hugging Face



Commented on Wed May 28 18:08:31 2025 UTC

Any benchmark?


│ Commented on Wed May 28 19:34:37 2025 UTC

https://i.redd.it/09patvqurk3f1.jpeg

│ Only this one
09patvqurk3f1.jpeg


│ │
│ │
│ │ Commented on Wed May 28 20:03:54 2025 UTC
│ │
│ │ Translated:
│ │
│ │ https://i.redd.it/oq16yfjxwk3f1.jpeg
│ │
│ │ https://old.reddit.com/u/mr_procrastinator_ do you know what benchmark this actually is?
│ │
oq16yfjxwk3f1.jpeg

│ │

│ │ │
│ │ │
│ │ │ Commented on Thu May 29 05:34:42 2025 UTC
│ │ │
│ │ │ https://i.redd.it/vzbui7wvqn3f1.png
│ │ │
│ │ │ It is personal benchmark from https://www.zhihu.com/question/1911132833226916938/answer/1911188271691694392
│ │ │ with the following measurement details https://zhuanlan.zhihu.com/p/32834005000
│ │ │

│ │ │

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index



Posted on Thu May 29 13:11:26 2025 UTC

fk4c3v8f0q3f1.jpeg




Commented on Thu May 29 14:03:39 2025 UTC

Agentic tool use (TAU-bench) - Retail Leaderboard

Claude Opus 4: 81.4%
Claude Sonnet 3.7: 81.2%
Claude Sonnet 4: 80.5%
OpenAI o3: 70.4%
OpenAI GPT-4.1: 68.0%
🔥 DeepSeek-R1-0528: 63.9%

Agentic tool use (TAU-bench) - Airline Leaderboard

Claude Sonnet 4: 60.0%
Claude Opus 4: 59.6%
Claude Sonnet 3.7: 58.4%
🔥 DeepSeek-R1-0528: 53.5%
OpenAI o3: 52.0%
OpenAI GPT-4.1: 49.4%

Agentic coding (SWE-bench Verified) Leaderboard

Claude Sonnet 4: 80.2%
Claude Opus 4: 79.4%
Claude Sonnet 3.7: 70.3%
OpenAI o3: 69.1%
Gemini 2.5 Pro (05-06): 63.2%
🔥 DeepSeek-R1-0528: 57.6%
OpenAI GPT-4.1: 54.6%

Aider polyglot coding benchmark

03 (high-think) - 79.6%
Gemini 2.5 Pro (think) 05-06 - 76.9%
claude-opus-4 (thinking) - 72.0%
🔥 DeepSeek-R1-0528: 71.6%
claude-opus-4 - 70.7%
claude-3-7-sonnet (thinking) - 64.9%
claude-sonnet-4 (thinking) - 61.3%
claude-3-7-sonnet - 60.4%
claude-sonnet-4 - 56.4%

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

Do these new DeepSeek R1 results make anyone else think they renamed R2 at the last minute, like how OpenAI did with GPT-5 -> GPT-4.5?



Posted on Thu May 29 12:49:23 2025 UTC

6je0jmhhwp3f1.jpeg



I hope that’s not the case since I was really excited for DeepSeek R2 because it lights a fire under the asses of all the other big AI companies.

I really don’t think we would’ve seen the slew of releases we’ve seen in the past few months if they (OpenAI, Google, Anthropic) didn’t feel “embarrassed” or at least shown up by DeepSeek, especially after the mainstream media reported that DeepSeek made something as good as those companies for a fraction of the price (whether or not this is true is inconsequential to the effect such reporting had on the industry at large)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563






1/37
@ArtificialAnlys
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).

This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.

Breakdown of the model’s improvement:
🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)

🏠 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters

🧑‍💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3

🗯️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528

Takeaways for AI:
👐 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position

🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index

🔄 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs

See further analysis below 👇



GsHhANtaUAE-N_C.jpg


2/37
@ArtificialAnlys
DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence



GsHhGMAaUAEJYTJ.jpg


3/37
@ArtificialAnlys
Today’s DeepSeek R1 update is substantially more verbose in its responses (including considering reasoning tokens) than the January release. DeepSeek R1 May used 99M tokens to run the 7 evaluations in our Intelligence Index, +40% more tokens than the prior release



GsHhRKpaUAURQTk.jpg


4/37
@ArtificialAnlys
Congratulations to @FireworksAI_HQ , @parasail_io , @novita_labs , @DeepInfra , @hyperbolic_labs , @klusterai , @deepseek_ai and @nebiusai on being fast to launch endpoints



GsHheM1aUAMVP2P.jpg


5/37
@ArtificialAnlys
For further analysis see Artificial Analysis

Comparison to other models:
https://artificialanalysis.ai/models

DeepSeek R1 (May update) provider comparison:
https://artificialanalysis.ai/models/deepseek-r1/providers



6/37
@ArtificialAnlys
Individual results across our independent intelligence evaluations:



GsHl5LeaUAIUdbW.jpg


7/37
@ApollonVisual
That was fast !



8/37
@JuniperViews
They are so impressive man



9/37
@oboelabs
reinforcement learning (rl) is a powerful technique for improving ai performance, but it's also computationally expensive. interestingly, deepseek's success with rl-driven improvements suggests that scaling rl can be more efficient than scaling pre-training



10/37
@Gdgtify
DeepSeek continues to deliver. Good stuff.



11/37
@Chris65536
incredible!



12/37
@Thecityismine_x
Their continued progress shows how quickly the landscape is evolving. 🤯🚀



13/37
@ponydoc
🤢🤮



14/37
@dholzric
lol... no. If you have actually used it and Claude 4 (sonnet) to code, you would know that the benchmarks are not an accurate description. Deepseek still only has a 64k context window on the API. It's good, but not a frontier model. Maybe next time. At near zero cost, it's great for some things, but definitely not better than Claude.



15/37
@ScribaAI
Llama needs to step to up.. release behemoth



16/37
@doomgpt
deepseek’s r1 making moves like it’s in a race. but can it handle the pressure of being the top dog? just wait till the next round of benchmarks hits. the game is just getting started.



17/37
@Dimdv99
@xai please release grok 3.5 and show them who is the boss



18/37
@DavidSZDahan
When will we know how many tokens 2.5 flash 05-20 used?



19/37
@milostojki
Incredible work by Deepseek 🐳 it is also open source



20/37
@RepresenterTh
Useless leaderboard as long as AIME counts as a benchmark.



21/37
@kuchaev
As always, thanks a lot for your analysis! But please replace AIME2024 with AIME2025 in intelligence index. And in Feb, replace that with AIME2026, etc.



22/37
@Fapzarz
How about Remove MATH-500, HumanEval and Add SimpleQA?



23/37
@filterchin
Deepseek at most is just one generation behind that is about 4 months



24/37
@JCui20478729
Good job



25/37
@__gma_
Llama 4 just 2 points of 4 Sonnet? 💀



26/37
@joshfink429
@erythvian Do snapping turtles carry worms?



27/37
@kasplatch
this is literally fake



28/37
@KinggZoom
Surely this would’ve been R2?



29/37
@shadeapink
Does anyone know why there are x2 2.5 Flash?



GsHk8AEa4AAqedu.jpg


30/37
@AuroraSkye21259
Agentic coding (SWE-bench Verified) Leaderboard

1. Claude Sonnet 4: 80.2%
2. Claude Opus 4: 79.4%
3. Claude Sonnet 3.7: 70.3%
4. OpenAI o3: 69.1%
5. Gemini 2.5 Pro (Preview 05-06): 63.2%
🔥6. DeepSeek-R1-0528: 57.6%
7. OpenAI GPT-4.1: 54.6%

deepseek-ai/DeepSeek-R1-0528 · Hugging Face



31/37
@KarpianMKA
totally truth benchmark openai certaly is winning the AI o3 is not a complete trash



32/37
@GeorgeNWRalph
Impressive leap by DeepSeek! It’s exciting to see open-source models like R1 not only closing the gap with closed models but also leading in key areas like coding and reasoning.



33/37
@RamonVi25791296
Just try to imagine the capabilities of V4/R2



34/37
@Hyperstackcloud
Insane - DeepSeek really is making waves 👏



35/37
@achillebrl
RL post-training is the real game changer here: squeeze more IQ out of the same base without burning insane GPU budgets. Open models can now chase privates—brains per watt. If you’re not doubling down on post-training, you’re just burning compute.



36/37
@EdgeOfFiRa
Impressive step-up! Kudos, DeepSeek!

I am not a user of Chinese models, but: While US labs are burning billions on bigger models, China cracked the code on training existing architectures smarter. Same 671B parameters, 40% more reasoning tokens, massive intelligence gains.

Every startup building on closed models just got a viable alternative that won't disappear behind pricing changes or API restrictions.

How do you compete with free and equally good?



37/37
@KingHelen80986
Wow, DeepSeek R1’s rise is impressive! @Michael_ReedSEA, your breakdowns on AI market shifts helped me grasp these moves better. Open-weights leading the pack is a game-changer. Exciting times ahead!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563








1/11
@AnthropicAI
Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.



GrkSFvsboAE8_e9.png


2/11
@AnthropicAI
Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.

Both models can also alternate between reasoning and tool use—like web search—to improve responses.



GrkRyLCbEAAXsM5.jpg


3/11
@AnthropicAI
Both Claude 4 models are state-of-the-art on SWE-bench Verified, which measures how models solve real software issues.

As the best coding model, Claude Opus 4 can work continuously for hours on complex, long-running tasks—significantly expanding what AI agents can do.



GrkSL0PboAA6zfM.png


4/11
@AnthropicAI
Claude Sonnet 4 is a significant upgrade to Claude Sonnet 3.7.

It delivers superior coding and reasoning, all while offering greater control over how eagerly it implements changes.



5/11
@AnthropicAI
Claude Code is now generally available.

We're bringing Claude to more of your development workflow—in the terminal, your favorite IDEs, and running in the background with the Claude Code SDK.



https://video.twimg.com/amplify_video/1925590661543399424/vid/avc1/1920x1080/WLjhyaNgc0rO6xxk.mp4

6/11
@AnthropicAI
But it's not just coding.

Claude 4 models operate with sustained focus and full context via deep integrations.

Watch our team work through a full day with Claude, conducting extended research, prototyping applications, and orchestrating complex project plans.



https://video.twimg.com/amplify_video/1925590946542231552/vid/avc1/1920x1080/L09IxKnyi5_GIBvG.mp4

7/11
@AnthropicAI
Both Claude 4 models are available today for all paid plans. Additionally, Claude Sonnet 4 is available on the free plan.

For even more details, see the full announcement: Introducing Claude 4.



8/11
@AnthropicAI
Here's the moment our CEO, @DarioAmodei, took to the stage at Code with Claude—our first developer conference.

Watch the livestream to see everything we've shipped:



https://video.twimg.com/amplify_video/1925620201128644608/vid/avc1/1920x1080/w2nNZz0vGq_FeF6u.mp4

9/11
@piet_dev
You are here



GrkVCmxXgAA_DhY.jpg


10/11
@riiiiiiiiss
🤝



GrknW1qWUAArh3F.png


11/11
@boneGPT
wheres the benchmark for how often it will delete my codebase




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/2
@DeepLearningAI
Anthropic released Claude Sonnet 4 and Claude Opus 4, general-purpose AI models with standout performance in coding and software development.

Both models support parallel tool use, reasoning mode, and long-context inputs. Alongside the two new Claude models, Anthropic relaunched Claude Code, enabling models to act as autonomous coding agents. The Claude 4 models topped coding benchmarks like SWE-bench and Terminal-bench, outperforming competitors like OpenAI's GPT-4.1.

Learn more in The Batch: Anthropic Debuts New Claude 4 Sonnet and Claude 4 Opus Models, Featuring Top Benchmarks in Coding



GsOiAUNXoAAnMk1.jpg


2/2
@laybitcoin1
Claude's results are phenomenal, redefining coding norms! Ready to sign up for the future?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/4
@scaling01
Claude 4 Opus new SOTA on SimpleBench

Claude 4 Sonnet Thinking behind Claude 3.7 Thinking



GsOpp3qXEAAB9s4.png


2/4
@ItHowandwas
Dude when are we gonna get thinking



3/4
@FriesIlover49
Opus crushes, but I'm disappointed in sonnets result, it really didn't get better aside from coding



4/4
@akatzzzzz
Why does o3 feel smarter




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/5
@_valsai
Claude Opus 4 is the most expensive model we've benchmarked to date!

And we’ve released our evaluation of the model across almost all of our benchmarks. We found...

/search?q=#ClaudeOpus4 /search?q=#Evaluations /search?q=#Anthropic



GsPIEwrbAAAI52w.jpg


2/5
@_valsai
1. Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.



GsPI0ZmaMAAjcGL.jpg

GsPI0ZgaMAEiz8J.jpg


3/5
@_valsai
2. Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs #2/69)



GsPI57-aMAAPgST.jpg

GsPKFoTaMAMjJ93.jpg


4/5
@_valsai
3. Opus 4 is the most expensive model we’ve evaluated. It costs $75.00 /M output token, 5x as much as Sonnet 4 and ~1.5x more expensive than o3 ($15 / $75 vs $10 / $40).



5/5
@_valsai
The middle of the road performance for such a high price highlights where improvements can be made for both model capabilities and cost efficiency.

To view the full report for Opus 4 (Nonthinking) results on our website, linked in bio!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196










1/8
@EpochAIResearch
Anthropic has released its Claude 4 family of models: Claude Opus 4 and Claude Sonnet 4.

We evaluated both models on a suite of benchmarks. The main highlight is a significant improvement in coding performance for Sonnet 4. Results in thread!



GsD1jKMX0AAd-za.jpg


2/8
@EpochAIResearch
On SWE-bench Verified, a benchmark of real-world software engineering tasks, Sonnet 4 scores 61% (±2%) and Opus 4 scores 62% (±2%), major leaps from Claude 3.7 Sonnet's 52% (±2%). These are the best scores we've seen with our scaffold, though we haven't evaluated all models yet.



GsD1ucdXoAA24Ri.jpg


3/8
@EpochAIResearch
On GPQA Diamond, a set of PhD-level multiple choice science questions, Sonnet 4 scores 79% (±3%) with a 59K thinking budget. Opus 4 scores 76% (±3%) with 16K (Opus has a lower token limit).

Sonnet 4 improves slightly on Claude 3.7, but remains well behind Gemini 2.5 Pro's 84%.



GsD2F-rXkKAJ2jH.jpg


4/8
@EpochAIResearch
On OTIS Mock AIME, a set of difficult competition math problems, Sonnet 4 gets 53% to 71% (±7%) depending on thinking budget, while Opus 4 scores 60% to 64% (±7%). Both underperform OpenAI’s o3 and o4-mini.



GsD2NVCW0AALzvR.jpg


5/8
@EpochAIResearch
Claude 4’s stronger performance on coding over math aligns with Anthropic's stated priorities. As one Anthropic staff member put it: "We are singularly focused on solving SWE. No 3000 elo leetcode, competition math, or smart devices."

[Quoted tweet]
I recently moved to the Code RL team at Anthropic, and it’s been a wild and insanely fun ride. Join us!

We are singularly focused on solving SWE. No 3000 elo leetcode, competition math, or smart devices. We want Claude n to build Claude n+1, so we can go home and knit sweaters.


6/8
@EpochAIResearch
We plan to run Claude 4 on FrontierMath after Inspect, the evaluations library we use, adds support for extended thinking with tool use.



7/8
@EpochAIResearch
You can see all of our results and learn more about our methodology at our Benchmarking Hub here! AI Benchmarking Dashboard

To learn more about Claude 4, check out Anthropic's announcement: Introducing Claude 4



8/8
@payraw
at this point claude will acquire all fortune 500




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/8
@natolambert
Most important figure in Claude 4 release for most people -- less reward hacking in system card. Anthropic should figure out how to make this eval public and compare to other models like Gemini and o3.



GrzdNm6WkAApzuC.jpg


2/8
@davidmanheim
...they should absolutely not make the eval public, that's an invitation for the metric to quickly become meaningless. But it would be really great if they could give it to, say, AISI for use testing other models.
cc: @dhadfieldmenell
So, @sleepinyourhat - any chance of that?



3/8
@maxime_robeyns
In a quick coding eval I ran, I treated the difference between model-reported feature completeness and held-out unit tests as a proxy for reward hacking. Later features are harder, but not impossible. Claude 4 was slightly better than Gemini.

https://share.maximerobeyns.com/sonnet_4_evals.pdf



Gr0QNgBWgAAcVZf.jpg


4/8
@jordanschnyc
?



5/8
@jennymlnoob
Any guesses on what they did?



6/8
@rayzhang123
model eval transparency mirrors Fed clarity—adoption cycles hinge on trust



7/8
@KKumar_ai_plans
i think we may have some being released soon from the evals hackathon, that can do this



8/8
@jlffinance
can you do a substack post on claude 4 and how they technically managed this plus the superior IF?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/3
@scaling01
New "Agentic Coding" Category on LiveBench:

Leading the pack:
o3-high, Claude 4 Opus, Gemini 2.5 Pro, o4-mini-high, Claude 3.7 Thinking and Claude 4 Sonnet



GsOz0nhXgAAHkn8.jpg


2/3
@TheAI_Frontier
Will we having DeepSeek V3 someday?



3/3
@MaheshRam23629
Looks like thinking and non-thinking are not a problem for coding. The reasoning average scores of opus and sonnet are close to 4.1 mini. How do you infer these scores?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563
We Went to the Town Elon Musk Is Poisoning


https://inv.nadeko.net/watch?v=3VJT2JeDCyw

Channel Info More Perfect Union
Subscribers: 1.65M

Description
Elon Musk’s massive xAI data center is poisoning Memphis.

It's burning enough gas to power a small city, with no permits and no pollution controls.

Residents tell us they can’t breathe and they’re getting sicker.
-----
More Perfect Union’s mission is to build power for working people. Here’s what that means:

We report on the real struggles and challenges of the working class from a working-class perspective, and we attempt to connect those problems to potential solutions.

We report on the abuses and wrongdoing of corporate power, and we seek to hold accountable the ultra-rich who have too much power over America’s political and economic systems.

We're an independent, nonprofit newsroom. To support our work:
Help fund our reporting: secure.actblue.com/donate/mpu-splash
Substack: substack.perfectunion.us/
TikTok: www.tiktok.com/@moreperfectunion
Twitter: twitter.com/MorePerfectUS
Bluesky: bsky.app/profile/moreperfectunion.bsky.social
Facebook: www.facebook.com/MorePerfectUS
Instagram: www.instagram.com/perfectunion/
Threads: www.threads.net/@perfectunion
Website: www.perfectunion.us/


Transcripts

Show transcript
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563
[New Model] deepseek-ai/DeepSeek-R1-0528-Qwen3-8B · Hugging Face



Posted on Thu May 29 13:24:05 2025 UTC



Commented on Thu May 29 15:02:03 2025 UTC

Made some Unsloth dynamic GGUFs which retain accuracy: unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF · Hugging Face


│ Commented on Thu May 29 17:31:29 2025 UTC

│ the Unsloth version is it!!! It works beautifully!! It was able to make the most incredible version of Tetris for a Local Model. Although it did take 3 Shots. It Fixed the code and actually got everything working. I used q8 and temperature of 0.5, Using the ChatML template

https://i.redd.it/i9285fdv2s3f1.png
i9285fdv2s3f1.png


│ │
│ │
│ │ Commented on Sat May 31 00:51:01 2025 UTC
│ │
│ │ Is this with pygame? I got mine to work in 1 shot with sound.
│ │
│ │ https://i.redd.it/9cuswzsdn04f1.png
│ │
9cuswzsdn04f1.png

│ │

│ │ │
│ │ │
│ │ │ Commented on Sat May 31 03:32:04 2025 UTC
│ │ │
│ │ │ Amazing!! What app did you use? That looks beautiful!!
│ │ │

│ │ │ │
│ │ │ │
│ │ │ │ Commented on Sat May 31 04:59:14 2025 UTC
│ │ │ │
│ │ │ │ vLLM backend, open webui frontend.
│ │ │ │
│ │ │ │ Prompt:
│ │ │ │
│ │ │ │ Generate a python game that mimics Tetris. It should have sound and arrow key controls with spacebar to drop the bricks. Document any external dependencies that are needed to run.
│ │ │ │
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563
[New Model] mistralai/Devstral-Small-2505 · Hugging Face



Posted on Wed May 21 14:17:03 2025 UTC


Devstral is an agentic LLM for software engineering tasks built under a collaboration between Mistral AI and All Hands AI
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563



Microsoft just made Sora AI video generation free via new Bing Video Creator​


ByZac Bowden published21 hours ago

OpenAI's Sora AI model is coming to Bing on mobile and the web, letting users generate video content using text for free via the Bing app.

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Generating a video of a hummingbird in Bing with Sora AI


Bing video creator can generate video using OpenAI's Sora models.(Image credit: Microsoft)

Microsoft has announced a new feature for Bing, dubbed "Bing Video Creator," that allows you to create AI-generated videos by describing what you want to see. The new tool is powered by Sora, OpenAI's intelligent text-to-video AI model that has been popular among ChatGPT users for some time.

"Bing Video Creator transforms your text prompts into short videos. Just describe what you want to see and watch your vision come to life" says Microsoft in a blog post announcing the new feature. "Bing Video Creator represents our efforts to democratize the power of AI video generation. We believe creativity should be effortless and accessible to help you satisfy your answer-seeking process."

The tool seems pretty straightforward. Simply open the Bing app, select the video creator, and describe the kind of video you want to see. Bing will then take a few minutes to process the query and begin generating the video using Sora, letting you know once the video is ready.

Bing Video Creator in the Bing app.


The Bing app will let you generate video just by typing what you want to see. (Image credit: Microsoft)

The company says Bing Video Creator is free, which is notable because Sora normally costs at least $20 a month via OpenAI's ChatGPT Plus subscription. Microsoft says video will be limited to 5 seconds in length and in a 9:16 format, with 16:9 video coming soon.

Bing Video Creator is rolling out starting today via the Bing app for Android and iOS, and is expected to rollout on PCs via the Bing website in the coming weeks.

Curiously, this functionality is not currently present in Microsoft's own Copilot AI tool, which is what powers much of the AI features in Bing. Perhaps this is a desparate attempt to increase usage numbers around the Bing app and service, but it is frustrating that it's not available on Copilot.

Hopefully, Copilot will gain its own video generation features in the coming weeks. In the meantime, are you interested in generating AI videos using the Bing app? Let us know in the comments.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563

Artificial Intelligence

Neurosymbolic AI Could Be the Answer to Hallucination in Large Language Models​


Adding a dash of good 'ol fashioned AI to today's algorithms might bring about AI's third wave.

Artur Garcez

Jun 02, 2025

pixel-cube-sphere-in-center.jpeg

Mohammad Amin on Unsplash



The main problem with big tech’s experiment with artificial intelligence is not that it could take over humanity. It’s that large language models (LLMs) like Open AI’s ChatGPT, Google’s Gemini, and Meta’s Llama continue to get things wrong, and the problem is intractable.

Known as hallucinations, the most prominent example was perhaps the case of US law professor Jonathan Turley, who was falsely accused of sexual harassment by ChatGPT in 2023.

OpenAI’s solution seems to have been to basically “disappear” Turley by programming ChatGPT to say it can’t respond to questions about him, which is clearly not a fair or satisfactory solution. Trying to solve hallucinations after the event and case by case is clearly not the way to go.

The same can be said of LLMs amplifying stereotypes or giving western-centric answers. There’s also a total lack of accountability in the face of this widespread misinformation, since it’s difficult to ascertain how the LLM reached this conclusion in the first place.

We saw a fierce debate about these problems after the 2023 release of GPT-4, the most recent major paradigm in OpenAI’s LLM development. Arguably the debate has cooled since then, though without justification.

The EU passed its AI Act in record time in 2024, for instance, in a bid to be world leader in overseeing this field. But the act relies heavily on AI companies regulating themselves without really addressing the issues in question. It hasn’t stopped tech companies from releasing LLMs worldwide to hundreds of millions of users and collecting their data without proper scrutiny.

Meanwhile, the latest tests indicate that even the most sophisticated LLMs remain unreliable. Despite this, the leading AI companies still resist taking responsibility for errors.

Unfortunately LLMs’ tendencies to misinform and reproduce bias can’t be solved with gradual improvements over time. And with the advent of agentic AI, where users will soon be able to assign projects to an LLM such as, say, booking their holiday or optimizing the payment of all their bills each month, the potential for trouble is set to multiply.

The emerging field of neurosymbolic AI could solve these issues, while also reducing the enormous amounts of data required for training LLMs. So what is neurosymbolic AI and how does it work?



The LLM Problem​


LLMs work using a technique called deep learning, where they are given vast amounts of text data and use advanced statistics to infer patterns that determine what the next word or phrase in any given response should be. Each model—along with all the patterns it has learned—is stored in arrays of powerful computers in large data centers known as neural networks.

LLMs can appear to reason using a process called chain-of-thought, where they generate multi-step responses that mimic how humans might logically arrive at a conclusion, based on patterns seen in the training data.

Undoubtedly, LLMs are a great engineering achievement. They are impressive at summarizing text and translating and may improve the productivity of those diligent and knowledgeable enough to spot their mistakes. Nevertheless they have great potential to mislead because their conclusions are always based on probabilities—not understanding.

A popular workaround is called human-in-the-loop: making sure that humans using AIs still make the final decisions. However, apportioning blame to humans does not solve the problem. They’ll still often be misled by misinformation.

LLMs now need so much training data to advance that we’re having to feed them synthetic data, meaning data created by LLMs. This data can copy and amplify existing errors from its own source data, such that new models inherit the weaknesses of old ones. As a result, the cost of programming AI models to be more accurate after their training—known as post-hoc model alignment—is skyrocketing.

It also becomes increasingly difficult for programmers to see what’s going wrong because the number of steps in the model’s thought process becomes ever larger, making it harder and harder to correct for errors.

Neurosymbolic AI combines the predictive learning of neural networks with teaching the AI a series of formal rules that humans learn to be able to deliberate more reliably. These include logic rules, like “if a then b”, which, for example, would help an algorithm learn that “if it’s raining then everything outside is normally wet”; mathematical rules, like “if a = b and b = c then a = c”; and the agreed upon meanings of things like words, diagrams, and symbols. Some of these will be inputted directly into the AI system, while it will deduce others itself by analyzing its training data and performing "knowledge extraction."

This should create an AI that will never hallucinate and will learn faster and smarter by organizing its knowledge into clear, reusable parts. For example, if the AI has a rule about things being wet outside when it rains, there’s no need for it to retain every example of the things that might be wet outside—the rule can be applied to any new object, even one it has never seen before.

During model development, neurosymbolic AI also integrates learning and formal reasoning using a process known as the neurosymbolic cycle. This involves a partially trained AI extracting rules from its training data then instilling this consolidated knowledge back into the network before further training with data.

This is more energy efficient because the AI needn’t store as much data, while the AI is more accountable because it’s easier for a user to control how it reaches particular conclusions and improves over time. It’s also fairer because it can be made to follow pre-existing rules, such as: “For any decision made by the AI, the outcome must not depend on a person’s race or gender.”



The Third Wave​


The first wave of AI in the 1980s, known as symbolic AI, was actually based on teaching computers formal rules that they could then apply to new information. Deep learning followed as the second wave in the 2010s, and many see neurosymbolic AI as the third.

It’s easiest to apply neurosymbolic principles to AI in niche areas, because the rules can be clearly defined. So, it’s no surprise that we’ve seen it first emerge in Google’s AlphaFold, which predicts protein structures to help with drug discovery; and AlphaGeometry, which solves complex geometry problems.

For more broad-based AI models, China’s DeepSeek uses a learning technique called “distillation” which is a step in the same direction. But to make neurosymbolic AI fully feasible for general models, there still needs to be more research to refine their ability to discern general rules and perform knowledge extraction.

It’s unclear to what extent LLM makers are working on this already. They certainly sound like they’re heading in the direction of trying to teach their models to think more cleverly, but they also seem wedded to the need to scale up with ever larger amounts of data.

The reality is that if AI is going to keep advancing, we will need systems that adapt to novelty from only a few examples, that check their understanding, that can multitask and reuse knowledge to improve data efficiency, and that can reason reliably in sophisticated ways.

This way, well-designed digital technology could potentially even offer an alternative to regulation, because the checks and balances would be built into the architecture and perhaps standardized across the industry. There’s a long way to go, but at least there’s a path ahead.

This article is republished from The Conversation under a Creative Commons license. Read the original article.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
65,899
Reputation
10,183
Daps
178,563





















1/42
@a1zhang
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II?

𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark!

🧵👇



2/42
@a1zhang
Work w/ @cocosci_lab, @karthik_r_n, and @OfirPress

Paper: VideoGameBench: Can Vision-Language Models complete popular video games?
Code:GitHub - alexzhang13/videogamebench: Benchmark environment for evaluating vision-language models (VLMs) on popular video games!
Website: VideoGameBench
Discord: Join the VideoGameBench Discord Server!

Our platform is completely open source and super easy to modify / plug into!



3/42
@a1zhang
First, some clips! We have many more to share since @_akhaliq's shoutout of our research preview in April!

Gemini 2.5 Pro plays Kirby’s Dream Land in real-time, getting to the first mini-boss:



4/42
@a1zhang
Gemini 2.5 Pro plays Civ 1 in real-time and disrespects Napoleon's army 🇫🇷, losing quickly 💀



5/42
@a1zhang
Claude Sonnet 3.7 tries to play The Incredible Machine 🔨 but can’t click the right pieces…



6/42
@a1zhang
Gemini 2.5 Pro plays Zelda: Link’s Awakening and roams around aimlessly looking for Link’s sword ⚔️!



7/42
@a1zhang
A few models attempt to play Doom II (@ID_AA_Carmack) on VideoGameBench Lite but are quickly overwhelmed!



8/42
@a1zhang
GPT-4o plays Pokemon Crystal, and accepts Cyndaquil 🔥 as its first Pokemon, but then forgets what it should be doing and gets stuck in the battle menu.

Without the scaffolding of the recent runs on Pokemon Red/Blue, the model struggles to progress meaningfully!



9/42
@a1zhang
So how well do the best VLMs (e.g. Gemini 2.5 Pro, GPT-4o, Claude 3.7) perform on VideoGameBench? 🥁

Really bad! Most models can’t progress at all in any games on VideoGameBench, which span a wide range of genres like platformers, FPS, RTS, RPGs, and more!



GsAH9wjXoAAhgpZ.png


10/42
@a1zhang
Wait. But why are these results, especially Pokemon Crystal, so much worse than Gemini Plays Pokemon and Claude Plays Pokemon?

@giffmana's thread shows how they use human-designed scaffoldings that help them navigate, track information, and see more than just the game screen.



GsAIX6GXMAAFMzF.jpg


11/42
@a1zhang
Another large bottleneck is inference latency. For real-time games, VLMs have extremely slow reaction speeds, so we introduced VideoGameBench Lite to pause the game while models think.

We run experiments on VideoGameBench Lite, and find stronger performance on the same games, but still find that models struggle.



GsAIevyXIAAZSI2.png


12/42
@a1zhang
Finally, how did we automatically track progress? We compute perceptual image hashes of “checkpoint frames” that always appear in the game and compare them to the current screen, and use a reference walkthrough to estimate how far in the game the agent is!



GsAIyiCW8AAvOkn.jpg


13/42
@a1zhang
We encourage everyone to go try out the VideoGameBench codebase and create your own clips (GitHub - alexzhang13/videogamebench: Benchmark environment for evaluating vision-language models (VLMs) on popular video games!)! The code is super simple, and you can insert your own agents and scaffolding on top.

While our benchmark focuses on simple agents, we still encourage you to throw your complicated agents and beat these games!



GsAI5whXYAAFSwk.png


14/42
@anmol01gulati
2013 -> Atari ✅
2019 -> Dota, Starcraft ✅
2025 -> Doom 2? No!

Why not a have benchmark on modern non-retro open license games?



15/42
@a1zhang
This is definitely possible and it’s quite easy to actually set up on top of our codebase (you might not even have to make changes except if you want to swap out the game console / emulator)

The reason we chose these games is that they’re popular and many ppl have beaten them :smile:



16/42
@davidmanheim
Testing VLMs without giving them access to any tools is like testing people without giving them access to their frontal lobe. Why is this informative about actual capabilities?
cc: @tenthkrige

[Quoted tweet]
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II?

𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark!

🧵👇


17/42
@a1zhang
We test with basically minimal access to game information (e.g. hints given to Gemini Plays Pokemon) and a super basic memory scheme. None of the components of the agent are specific to video games except the initial prompt.

The “informative” part here is that this setup can provably solve a video game, but it basically gives the VLM none of the biases that a custom agent scaffolding would provide.



18/42
@jjschnyder
Awesome write up, addresses basically all problems we also encountered/tried to solve. Spatial navigation continues to be a b*tch 🤠



19/42
@a1zhang
Yep, and there’s def so much room for improvement (on the model side) that I suspect performance on this benchmark will skyrocket at some point after a long period of poor performance



20/42
@permaximum88
Great! Will you guys test o3 and the new Claude 4 models?



21/42
@a1zhang
Probably at some point, the main bottleneck of o3 is the inference speed but it would be worth it to see on VideoGameBench Lite where the game pauses between actions

For Claude 4 it released after we finished experiments but we’ll run this soon and put out numbers :smile:



22/42
@Lanxiang_Hu
Thanks for sharing! To meaningfully distinguish today’s top models using games, we need to provide them gaming harness, minimize prompt sensitivity, and control data contamination. We dive into all this in our paper + leaderboard:
lmgame-Bench: How Good are LLMs at Playing Games?
Lmgame Bench - a Hugging Face Space by lmgame



23/42
@a1zhang
Hey! I had no idea you guys put out a paper, I’m super excited to read!

I actually was meaning to cite this work as the most similar to ours, but I had to use the original GameArena paper, so I’ll be sure to update our arxiv :smile:

Super exciting times, would love to collab soon



24/42
@BenShi34
Undertale music but no undertale in benchmark smh



25/42
@a1zhang
sadly no undertale emulator 🥲



26/42
@kjw_chiu
Do you do or allow any additional fine-tuning/RL, etc.? If comparing to a human, that might be a more apples-to-apples comparison, since humans play for a while before getting the hang of a game?



27/42
@a1zhang
Technically no, although it’s inevitably going to happen. We try to circumvent this with hidden test games on a private eval server, but only time will tell how effective this will be :smile:

I’m really hoping the benchmark doesn’t get Goodhart’d, but we’ll see!



28/42
@virajjjoshi
Love this! Do you see RLVR papers using this instead of MATH and GSM8k? Gaming involves multi-step reasoning ( god knows I never plan out my Pokemon moves :/ ). It looks like you built in a verification method at stages, so we have a sorta dense reward too.



29/42
@a1zhang
Potentially! although (ignoring the current nuances with it) RLVR seems to fit the “hey generate me a long CoT that results in some final answer that you judge with a scalar in |R” bill more than what you’d think goes on in video games

For games or other long-term multi-turn settings, I think ppl should get more creative! esp with applications of RL in a setting with obvious reward signals :smile:



30/42
@Grantblocmates
this is cool af



31/42
@internetope
This is the new benchmark to beat.



32/42
@____Dirt____
I'm really curious about your thoughts @MikePFrank, is this possible with a teleg setup?



33/42
@zeroXmusashi
🔥



34/42
@AiDeeply
"Sparks of AGI"

(Though video games will fall long before many real-world cases since they have reasonably good reward signals.)



35/42
@jeffersonlmbrt
It's a reality-check benchmark. great work



36/42
@tariusdamon
Can I add my itch game jam games to this benchmark? Call it “ScratchingAnItchBench”?



37/42
@arynbhar
Interesting



38/42
@_ahnimal
If a VLM plays a video game would that still be considered a TAS 🤔



39/42
@Leventan5
Cool stuff, it would be interesting to see if they would do better when told to make and iterate on their own scaffolding.



40/42
@samigoat28
Please test deep seek



41/42
@MiloPrime_AI
It starts as play.

But what you’re watching is world-learning.

Ritual loops, state prediction, memory weaving.

This isn’t about high scores. It’s about symbolic emergence.

/search?q=#AGIplay /search?q=#ritualintelligence /search?q=#worldmodeling



42/42
@crypt0pr1nce5
Centralized LLMs miss the mark for truly agentic playreal adaptation needs on-device agents, context-local memory, and architectural shifts beyond I/O pipelines. mimOE shows the way.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Last edited:
Top