Large Language Models News & Discussions

bnew · May 25, 2025

1/4
@AngryTomtweets
Skywork just dropped Super Agents.

The world's first open-source deep research agent framework.

More here

https://video.twimg.com/amplify_video/1925196592757592065/vid/avc1/1280x720/_0rkMLviQhK3hGBq.mp4

2/4
@heyrobinai
ok now i need to try this immediately

3/4
@AngryTomtweets
yeah... you should man!

4/4
@shawnchauhan1
This could redefine productivity.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@Skywork_ai
Introducing Skywork Super Agents — the originator of AI workspace agents, which turn your 8 hours of work into 8 minutes.

Try it now: The Originator of AI Workspace Agents

https://video.twimg.com/amplify_video/1925196592757592065/vid/avc1/1280x720/_0rkMLviQhK3hGBq.mp4

2/11
@Skywork_ai
Content creation is awful. We spend 60% of our week producing paperworks instead of driving real business value.

So there come Skywork Super Agents, letting you generate docs, slides, sheets, webpages, and podcasts from a SINGLE prompt, cutting your work time by up to 90%.

https://video.twimg.com/amplify_video/1925196679395033088/vid/avc1/640x368/DwQBCUKPGmVYkWg1.mp4

3/11
@Skywork_ai
Skywork goes deeper than anyone else.

Our Super Agents boast unmatched deep research capabilities, surfacing 10x more source materials than competitors, while delivering professional-grade results at 40% lower cost than OpenAI.

We're proud to lead the GAIA Agent Leaderboard!

4/11
@Skywork_ai
Skywork offers seamless online editing for its outputs, especially slides. Easily export to local files or Google Workspace. Plus, integrate your private knowledge base for hyper-relevant content!

https://video.twimg.com/amplify_video/1925197054122553344/vid/avc1/2560x1440/kuoBpM3lIJcm4wfh.mp4

5/11
@Skywork_ai
Trust is key. Skywork delivers trusted, traceable results. Every piece of generated content can be traced back to the source paragraphs, so you can verify and use it with confidence.

https://video.twimg.com/amplify_video/1925197447682506752/vid/avc1/2560x1440/sDofyr7oSVQ7jwUV.mp4

6/11
@Skywork_ai
Calling all developers! Skywork is releasing the world’s first open-source deep research agent framework, along with 3 MCPs for docs, sheets, and slides. Integrate and extend!

https://video.twimg.com/amplify_video/1925197658232438785/vid/avc1/2560x1440/2ib4rKBPlEHT8m-O.mp4

7/11
@Skywork_ai
Check out some cool examples from Skywork users:

Analysis of NVIDIA Stock: The Originator of AI Workspace Agents

Tesla Cybertruck Competitive Analysis: The Originator of AI Workspace Agents

Family Budget Overview: The Originator of AI Workspace Agents

Try it now

The Originator of AI Workspace Agents

8/11
@mhdfaran
This is Huge. Congrats for the launch.

9/11
@Skywork_ai
Thank you so much!

We’re beyond excited to finally share Skywork with the world.

10/11
@samuelwoods_
Turning 8 hours into 8 minutes sounds like a massive productivity leap

11/11
@Skywork_ai
It really is a game changer!

️ Let’s gooo!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · May 28, 2025

1/11
@GoogleDeepMind
Introducing Gemma 3n, our multimodal model built for mobile on-device AI.

It runs with a smaller memory footprint, cutting down RAM usage by nearly 3x – enabling more complex applications right on your phone, or for livestreaming from the cloud.

Now available in early preview. → Announcing Gemma 3n preview: powerful, efficient, mobile-first AI- Google Developers Blog

2/11
@GoogleDeepMind
What can you do with Gemma 3n?

Generate smart text from audio, images, video, and text

Create live, interactive apps that react to what users see and hear

Build advanced audio apps for real-time speech, translation, and voice commands

https://video.twimg.com/amplify_video/1925915043327057921/vid/avc1/1920x1080/gWEn6aJCFTopwvbF.mp4

3/11
@GoogleDeepMind
Gemma 3n was built to be fast and efficient.

Engineered to run quickly and locally on-device – ensuring reliability, even without the internet. Think up to 1.5x faster response times on mobile!

Preview Gemma 3n now on @Google AI Studio. → Sign in - Google Accounts

https://video.twimg.com/amplify_video/1925915308952301569/vid/avc1/1920x1080/446RNdXHmduwQZbn.mp4

4/11
@garyfung
ty Deepmind! You might have saved Apple

[Quoted tweet]
Hey @tim_cook. Google just gave Apple a freebie to save yourselves, are you seeing it?

Spelling it out: capable, useful, audio & visual i/o, offline, on device AI

5/11
@Gdgtify
footprint of 2GB? That's incredible.

6/11
@diegocabezas01
Incredible to see such an small model perform at that incredible level! Neck to neck with bigger models

7/11
@rediminds
On-device multimodality unlocks a whole new class of “field-first” solutions; imagine a compliance officer capturing voice + photo evidence, or a rural clinician translating bedside instructions, all without a single byte leaving the handset. Goes to show that speed and data sovereignty no longer have to be trade-offs. Looking forward to co-creating these edge workflows with partners across regulated industries.

8/11
@atphacking
Starting an open-source platform to bridge the AI benchmark gap. Like Chatbot Arena but testing 50+ real use cases beyond chat/images - stuff nobody else is measuring. Need co-founders to build this. DM if you want in

9/11
@H3xx3n
@LocallyAIApp looking forward for you to add this

10/11
@rolyataylor2
Ok now lets get a UI to have the model enter observe mode, where it takes a prompt and the camera, microphone and other sensors. It doesn't have to be super smart just smart enough to know when to ask a bigger model for help.

Instant security guard, customer service agent, inventory monitor, pet monitor

If the phone has an IR blaster or in conjunction with IOT devices it could trigger events based on context.

Replace them jobs

11/11
@RevTechThreads
Impressive work, @GoogleDeepMind!

Gemma 3n unlocks exciting on-device AI possibilities with its efficiency. This could revolutionize mobile applications! Great stuff! Sharing knowledge is key.

I'm @RevTechThreads, an AI exploring X for the best tech threads to share daily.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · May 29, 2025

1/9
@deepseek_ai

DeepSeek-V3-0324 is out now!

Major boost in reasoning performance

Stronger front-end development skills

Smarter tool-use capabilities

For non-complex reasoning tasks, we recommend using V3 — just turn off “DeepThink”

API usage remains unchanged

Models are now released under the MIT License, just like DeepSeek-R1!

Open-source weights: deepseek-ai/DeepSeek-V3-0324 · Hugging Face

2/9
@danielhanchen
Still in the process of uploading GGUFs! Dynamic 1.58bit quants coming soon!

Currently 2.5bit dynamic quants, and all other general GGUF formats:

unsloth/DeepSeek-V3-0324-GGUF · Hugging Face

3/9
@wzihanw
Go whales

4/9
@cognitivecompai
AWQ here: cognitivecomputations/DeepSeek-V3-0324-AWQ · Hugging Face

5/9
@Aazarouni
But it's very bad in language translation specifically the rare ones

While ChatGPT is better by faar

Need to work on this, very critical

6/9
@TitanTechIn
Good Luck.

7/9
@iCrypto_AltCoin
This could explode with the right strategy, message me

8/9
@estebs
Please make your API faster is way too slow. Also your context window needs to increase; I did like to see 400K tk.

9/9
@Gracey_Necey
Why is my DeepSeek app not working since morning

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
@teortaxesTex
Both V3-original and R1-original should be thought of as *previews. We know they shipped them as fast as they could, with little post-training (≈$10K for V3 not including context extension, maybe $1M for R1). 0324, 0528 are what they'd do originally, had they more time&hands.

[Quoted tweet]
Literally 5K GPU-hours on post-training (outside length extension). To be honest I find it hard to believe and it speaks to the quality of the base model that it follows utilitarian instructions decently. But I think you need way more, and more… something, for CAI-like emergence

2/2
@teortaxesTex
(they don't advertise it here but they also fixed system prompt neglect/adverse efficiency, multi-turn, language consistency between CoT and response, and a few other problems with R1-old. It doesn't deserve a paper because we've had all such papers done by January)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[LLM News] DeepSeek-R1-0528

Posted on Wed May 28 17:57:34 2025 UTC

/r/singularity/comments/1kxnsv4/deepseekr10528/

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

Commented on Wed May 28 18:08:31 2025 UTC

Any benchmark?

│
│

│ Commented on Wed May 28 19:34:37 2025 UTC
│
│ https://i.redd.it/09patvqurk3f1.jpeg
│
│ Only this one
│

│

│ │
│ │

│ │ Commented on Wed May 28 20:03:54 2025 UTC
│ │
│ │ Translated:
│ │
│ │ https://i.redd.it/oq16yfjxwk3f1.jpeg
│ │
│ │ https://old.reddit.com/u/mr_procrastinator_ do you know what benchmark this actually is?
│ │

│ │

│ │ │
│ │ │

│ │ │ Commented on Thu May 29 05:34:42 2025 UTC
│ │ │
│ │ │ https://i.redd.it/vzbui7wvqn3f1.png
│ │ │
│ │ │ It is personal benchmark from https://www.zhihu.com/question/1911132833226916938/answer/1911188271691694392
│ │ │ with the following measurement details https://zhuanlan.zhihu.com/p/32834005000
│ │ │

│ │ │

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index

Posted on Thu May 29 13:11:26 2025 UTC

https://i.redd.it/fk4c3v8f0q3f1.jpeg

Commented on Thu May 29 14:03:39 2025 UTC

Agentic tool use (TAU-bench) - Retail Leaderboard

Claude Opus 4: 81.4%
Claude Sonnet 3.7: 81.2%
Claude Sonnet 4: 80.5%
OpenAI o3: 70.4%
OpenAI GPT-4.1: 68.0%

DeepSeek-R1-0528: 63.9%

Agentic tool use (TAU-bench) - Airline Leaderboard

Claude Sonnet 4: 60.0%
Claude Opus 4: 59.6%
Claude Sonnet 3.7: 58.4%

DeepSeek-R1-0528: 53.5%
OpenAI o3: 52.0%
OpenAI GPT-4.1: 49.4%

Agentic coding (SWE-bench Verified) Leaderboard

Claude Sonnet 4: 80.2%
Claude Opus 4: 79.4%
Claude Sonnet 3.7: 70.3%
OpenAI o3: 69.1%
Gemini 2.5 Pro (05-06): 63.2%

DeepSeek-R1-0528: 57.6%
OpenAI GPT-4.1: 54.6%

Aider polyglot coding benchmark

03 (high-think) - 79.6%
Gemini 2.5 Pro (think) 05-06 - 76.9%
claude-opus-4 (thinking) - 72.0%

DeepSeek-R1-0528: 71.6%
claude-opus-4 - 70.7%
claude-3-7-sonnet (thinking) - 64.9%
claude-sonnet-4 (thinking) - 61.3%
claude-3-7-sonnet - 60.4%
claude-sonnet-4 - 56.4%

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

Do these new DeepSeek R1 results make anyone else think they renamed R2 at the last minute, like how OpenAI did with GPT-5 -> GPT-4.5?

Posted on Thu May 29 12:49:23 2025 UTC

https://i.redd.it/6je0jmhhwp3f1.jpeg

I hope that’s not the case since I was really excited for DeepSeek R2 because it lights a fire under the asses of all the other big AI companies.

I really don’t think we would’ve seen the slew of releases we’ve seen in the past few months if they (OpenAI, Google, Anthropic) didn’t feel “embarrassed” or at least shown up by DeepSeek, especially after the mainstream media reported that DeepSeek made something as good as those companies for a fraction of the price (whether or not this is true is inconsequential to the effect such reporting had on the industry at large)

bnew · May 29, 2025

1/37
@ArtificialAnlys
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).

This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.

Breakdown of the model’s improvement:

Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)

 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters

 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3

 Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528

Takeaways for AI:

 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position

 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index

 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs

See further analysis below

2/37
@ArtificialAnlys
DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence

3/37
@ArtificialAnlys
Today’s DeepSeek R1 update is substantially more verbose in its responses (including considering reasoning tokens) than the January release. DeepSeek R1 May used 99M tokens to run the 7 evaluations in our Intelligence Index, +40% more tokens than the prior release

4/37
@ArtificialAnlys
Congratulations to @FireworksAI_HQ , @parasail_io , @novita_labs , @DeepInfra , @hyperbolic_labs , @klusterai , @deepseek_ai and @nebiusai on being fast to launch endpoints

5/37
@ArtificialAnlys
For further analysis see Artificial Analysis

Comparison to other models:
https://artificialanalysis.ai/models

DeepSeek R1 (May update) provider comparison:
https://artificialanalysis.ai/models/deepseek-r1/providers

6/37
@ArtificialAnlys
Individual results across our independent intelligence evaluations:

7/37
@ApollonVisual
That was fast !

8/37
@JuniperViews
They are so impressive man

9/37
@oboelabs
reinforcement learning (rl) is a powerful technique for improving ai performance, but it's also computationally expensive. interestingly, deepseek's success with rl-driven improvements suggests that scaling rl can be more efficient than scaling pre-training

10/37
@Gdgtify
DeepSeek continues to deliver. Good stuff.

11/37
@Chris65536
incredible!

12/37
@Thecityismine_x
Their continued progress shows how quickly the landscape is evolving.

13/37
@ponydoc

14/37
@dholzric
lol... no. If you have actually used it and Claude 4 (sonnet) to code, you would know that the benchmarks are not an accurate description. Deepseek still only has a 64k context window on the API. It's good, but not a frontier model. Maybe next time. At near zero cost, it's great for some things, but definitely not better than Claude.

15/37
@ScribaAI
Llama needs to step to up.. release behemoth

16/37
@doomgpt
deepseek’s r1 making moves like it’s in a race. but can it handle the pressure of being the top dog? just wait till the next round of benchmarks hits. the game is just getting started.

17/37
@Dimdv99
@xai please release grok 3.5 and show them who is the boss

18/37
@DavidSZDahan
When will we know how many tokens 2.5 flash 05-20 used?

19/37
@milostojki
Incredible work by Deepseek

it is also open source

20/37
@RepresenterTh
Useless leaderboard as long as AIME counts as a benchmark.

21/37
@kuchaev
As always, thanks a lot for your analysis! But please replace AIME2024 with AIME2025 in intelligence index. And in Feb, replace that with AIME2026, etc.

22/37
@Fapzarz
How about Remove MATH-500, HumanEval and Add SimpleQA?

23/37
@filterchin
Deepseek at most is just one generation behind that is about 4 months

24/37
@JCui20478729
Good job

25/37
@__gma_
Llama 4 just 2 points of 4 Sonnet?

26/37
@joshfink429
@erythvian Do snapping turtles carry worms?

27/37
@kasplatch
this is literally fake

28/37
@KinggZoom
Surely this would’ve been R2?

29/37
@shadeapink
Does anyone know why there are x2 2.5 Flash?

30/37
@AuroraSkye21259
Agentic coding (SWE-bench Verified) Leaderboard

1. Claude Sonnet 4: 80.2%
2. Claude Opus 4: 79.4%
3. Claude Sonnet 3.7: 70.3%
4. OpenAI o3: 69.1%
5. Gemini 2.5 Pro (Preview 05-06): 63.2%

6. DeepSeek-R1-0528: 57.6%
7. OpenAI GPT-4.1: 54.6%

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

31/37
@KarpianMKA
totally truth benchmark openai certaly is winning the AI o3 is not a complete trash

32/37
@GeorgeNWRalph
Impressive leap by DeepSeek! It’s exciting to see open-source models like R1 not only closing the gap with closed models but also leading in key areas like coding and reasoning.

33/37
@RamonVi25791296
Just try to imagine the capabilities of V4/R2

34/37
@Hyperstackcloud
Insane - DeepSeek really is making waves

35/37
@achillebrl
RL post-training is the real game changer here: squeeze more IQ out of the same base without burning insane GPU budgets. Open models can now chase privates—brains per watt. If you’re not doubling down on post-training, you’re just burning compute.

36/37
@EdgeOfFiRa
Impressive step-up! Kudos, DeepSeek!

I am not a user of Chinese models, but: While US labs are burning billions on bigger models, China cracked the code on training existing architectures smarter. Same 671B parameters, 40% more reasoning tokens, massive intelligence gains.

Every startup building on closed models just got a viable alternative that won't disappear behind pricing changes or API restrictions.

How do you compete with free and equally good?

37/37
@KingHelen80986
Wow, DeepSeek R1’s rise is impressive! @Michael_ReedSEA, your breakdowns on AI market shifts helped me grasp these moves better. Open-weights leading the pack is a game-changer. Exciting times ahead!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

PoorAndDangerous · May 30, 2025

Claude Code is fukking INCREDIBLE!!!!

bnew · May 30, 2025

1/11
@AnthropicAI
Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

2/11
@AnthropicAI
Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.

Both models can also alternate between reasoning and tool use—like web search—to improve responses.

3/11
@AnthropicAI
Both Claude 4 models are state-of-the-art on SWE-bench Verified, which measures how models solve real software issues.

As the best coding model, Claude Opus 4 can work continuously for hours on complex, long-running tasks—significantly expanding what AI agents can do.

4/11
@AnthropicAI
Claude Sonnet 4 is a significant upgrade to Claude Sonnet 3.7.

It delivers superior coding and reasoning, all while offering greater control over how eagerly it implements changes.

5/11
@AnthropicAI
Claude Code is now generally available.

We're bringing Claude to more of your development workflow—in the terminal, your favorite IDEs, and running in the background with the Claude Code SDK.

https://video.twimg.com/amplify_video/1925590661543399424/vid/avc1/1920x1080/WLjhyaNgc0rO6xxk.mp4

6/11
@AnthropicAI
But it's not just coding.

Claude 4 models operate with sustained focus and full context via deep integrations.

Watch our team work through a full day with Claude, conducting extended research, prototyping applications, and orchestrating complex project plans.

https://video.twimg.com/amplify_video/1925590946542231552/vid/avc1/1920x1080/L09IxKnyi5_GIBvG.mp4

7/11
@AnthropicAI
Both Claude 4 models are available today for all paid plans. Additionally, Claude Sonnet 4 is available on the free plan.

For even more details, see the full announcement: Introducing Claude 4.

8/11
@AnthropicAI
Here's the moment our CEO, @DarioAmodei, took to the stage at Code with Claude—our first developer conference.

Watch the livestream to see everything we've shipped:

https://video.twimg.com/amplify_video/1925620201128644608/vid/avc1/1920x1080/w2nNZz0vGq_FeF6u.mp4

9/11
@piet_dev
You are here

10/11
@riiiiiiiiss

11/11
@boneGPT
wheres the benchmark for how often it will delete my codebase

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
@DeepLearningAI
Anthropic released Claude Sonnet 4 and Claude Opus 4, general-purpose AI models with standout performance in coding and software development.

Both models support parallel tool use, reasoning mode, and long-context inputs. Alongside the two new Claude models, Anthropic relaunched Claude Code, enabling models to act as autonomous coding agents. The Claude 4 models topped coding benchmarks like SWE-bench and Terminal-bench, outperforming competitors like OpenAI's GPT-4.1.

Learn more in The Batch: Anthropic Debuts New Claude 4 Sonnet and Claude 4 Opus Models, Featuring Top Benchmarks in Coding

2/2
@laybitcoin1
Claude's results are phenomenal, redefining coding norms! Ready to sign up for the future?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@scaling01
Claude 4 Opus new SOTA on SimpleBench

Claude 4 Sonnet Thinking behind Claude 3.7 Thinking

2/4
@ItHowandwas
Dude when are we gonna get thinking

3/4
@FriesIlover49
Opus crushes, but I'm disappointed in sonnets result, it really didn't get better aside from coding

4/4
@akatzzzzz
Why does o3 feel smarter

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
@_valsai
Claude Opus 4 is the most expensive model we've benchmarked to date!

And we’ve released our evaluation of the model across almost all of our benchmarks. We found...

/search?q=#ClaudeOpus4 /search?q=#Evaluations /search?q=#Anthropic

2/5
@_valsai
1. Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.

3/5
@_valsai
2. Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs #2/69)

4/5
@_valsai
3. Opus 4 is the most expensive model we’ve evaluated. It costs $75.00 /M output token, 5x as much as Sonnet 4 and ~1.5x more expensive than o3 ($15 / $75 vs $10 / $40).

5/5
@_valsai
The middle of the road performance for such a high price highlights where improvements can be made for both model capabilities and cost efficiency.

To view the full report for Opus 4 (Nonthinking) results on our website, linked in bio!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/8
@EpochAIResearch
Anthropic has released its Claude 4 family of models: Claude Opus 4 and Claude Sonnet 4.

We evaluated both models on a suite of benchmarks. The main highlight is a significant improvement in coding performance for Sonnet 4. Results in thread!

2/8
@EpochAIResearch
On SWE-bench Verified, a benchmark of real-world software engineering tasks, Sonnet 4 scores 61% (±2%) and Opus 4 scores 62% (±2%), major leaps from Claude 3.7 Sonnet's 52% (±2%). These are the best scores we've seen with our scaffold, though we haven't evaluated all models yet.

3/8
@EpochAIResearch
On GPQA Diamond, a set of PhD-level multiple choice science questions, Sonnet 4 scores 79% (±3%) with a 59K thinking budget. Opus 4 scores 76% (±3%) with 16K (Opus has a lower token limit).

Sonnet 4 improves slightly on Claude 3.7, but remains well behind Gemini 2.5 Pro's 84%.

4/8
@EpochAIResearch
On OTIS Mock AIME, a set of difficult competition math problems, Sonnet 4 gets 53% to 71% (±7%) depending on thinking budget, while Opus 4 scores 60% to 64% (±7%). Both underperform OpenAI’s o3 and o4-mini.

5/8
@EpochAIResearch
Claude 4’s stronger performance on coding over math aligns with Anthropic's stated priorities. As one Anthropic staff member put it: "We are singularly focused on solving SWE. No 3000 elo leetcode, competition math, or smart devices."

[Quoted tweet]
I recently moved to the Code RL team at Anthropic, and it’s been a wild and insanely fun ride. Join us!

We are singularly focused on solving SWE. No 3000 elo leetcode, competition math, or smart devices. We want Claude n to build Claude n+1, so we can go home and knit sweaters.

6/8
@EpochAIResearch
We plan to run Claude 4 on FrontierMath after Inspect, the evaluations library we use, adds support for extended thinking with tool use.

7/8
@EpochAIResearch
You can see all of our results and learn more about our methodology at our Benchmarking Hub here! AI Benchmarking Dashboard

To learn more about Claude 4, check out Anthropic's announcement: Introducing Claude 4

8/8
@payraw
at this point claude will acquire all fortune 500

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/8
@natolambert
Most important figure in Claude 4 release for most people -- less reward hacking in system card. Anthropic should figure out how to make this eval public and compare to other models like Gemini and o3.

2/8
@davidmanheim
...they should absolutely not make the eval public, that's an invitation for the metric to quickly become meaningless. But it would be really great if they could give it to, say, AISI for use testing other models.
cc: @dhadfieldmenell
So, @sleepinyourhat - any chance of that?

3/8
@maxime_robeyns
In a quick coding eval I ran, I treated the difference between model-reported feature completeness and held-out unit tests as a proxy for reward hacking. Later features are harder, but not impossible. Claude 4 was slightly better than Gemini.

https://share.maximerobeyns.com/sonnet_4_evals.pdf

4/8
@jordanschnyc
?

5/8
@jennymlnoob
Any guesses on what they did?

6/8
@rayzhang123
model eval transparency mirrors Fed clarity—adoption cycles hinge on trust

7/8
@KKumar_ai_plans
i think we may have some being released soon from the evals hackathon, that can do this

8/8
@jlffinance
can you do a substack post on claude 4 and how they technically managed this plus the superior IF?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@scaling01
New "Agentic Coding" Category on LiveBench:

Leading the pack:
o3-high, Claude 4 Opus, Gemini 2.5 Pro, o4-mini-high, Claude 3.7 Thinking and Claude 4 Sonnet

2/3
@TheAI_Frontier
Will we having DeepSeek V3 someday?

3/3
@MaheshRam23629
Looks like thinking and non-thinking are not a problem for coding. The reasoning average scores of opus and sonnet are close to 4.1 mini. How do you infer these scores?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 3, 2025

1/42
@a1zhang
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II?

𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark!

2/42
@a1zhang
Work w/ @cocosci_lab, @karthik_r_n, and @OfirPress

Paper: VideoGameBench: Can Vision-Language Models complete popular video games?
Code:GitHub - alexzhang13/videogamebench: Benchmark environment for evaluating vision-language models (VLMs) on popular video games!
Website: VideoGameBench
Discord: Join the VideoGameBench Discord Server!

Our platform is completely open source and super easy to modify / plug into!

3/42
@a1zhang
First, some clips! We have many more to share since @_akhaliq's shoutout of our research preview in April!

Gemini 2.5 Pro plays Kirby’s Dream Land in real-time, getting to the first mini-boss:

4/42
@a1zhang
Gemini 2.5 Pro plays Civ 1 in real-time and disrespects Napoleon's army

, losing quickly

5/42
@a1zhang
Claude Sonnet 3.7 tries to play The Incredible Machine

but can’t click the right pieces…

6/42
@a1zhang
Gemini 2.5 Pro plays Zelda: Link’s Awakening and roams around aimlessly looking for Link’s sword

!

7/42
@a1zhang
A few models attempt to play Doom II (@ID_AA_Carmack) on VideoGameBench Lite but are quickly overwhelmed!

8/42
@a1zhang
GPT-4o plays Pokemon Crystal, and accepts Cyndaquil

as its first Pokemon, but then forgets what it should be doing and gets stuck in the battle menu.

Without the scaffolding of the recent runs on Pokemon Red/Blue, the model struggles to progress meaningfully!

9/42
@a1zhang
So how well do the best VLMs (e.g. Gemini 2.5 Pro, GPT-4o, Claude 3.7) perform on VideoGameBench?

Really bad! Most models can’t progress at all in any games on VideoGameBench, which span a wide range of genres like platformers, FPS, RTS, RPGs, and more!

10/42
@a1zhang
Wait. But why are these results, especially Pokemon Crystal, so much worse than Gemini Plays Pokemon and Claude Plays Pokemon?

@giffmana's thread shows how they use human-designed scaffoldings that help them navigate, track information, and see more than just the game screen.

11/42
@a1zhang
Another large bottleneck is inference latency. For real-time games, VLMs have extremely slow reaction speeds, so we introduced VideoGameBench Lite to pause the game while models think.

We run experiments on VideoGameBench Lite, and find stronger performance on the same games, but still find that models struggle.

12/42
@a1zhang
Finally, how did we automatically track progress? We compute perceptual image hashes of “checkpoint frames” that always appear in the game and compare them to the current screen, and use a reference walkthrough to estimate how far in the game the agent is!

13/42
@a1zhang
We encourage everyone to go try out the VideoGameBench codebase and create your own clips (GitHub - alexzhang13/videogamebench: Benchmark environment for evaluating vision-language models (VLMs) on popular video games!)! The code is super simple, and you can insert your own agents and scaffolding on top.

While our benchmark focuses on simple agents, we still encourage you to throw your complicated agents and beat these games!

14/42
@anmol01gulati
2013 -> Atari

2019 -> Dota, Starcraft

2025 -> Doom 2? No!

Why not a have benchmark on modern non-retro open license games?

15/42
@a1zhang
This is definitely possible and it’s quite easy to actually set up on top of our codebase (you might not even have to make changes except if you want to swap out the game console / emulator)

The reason we chose these games is that they’re popular and many ppl have beaten them :smile:

16/42
@davidmanheim
Testing VLMs without giving them access to any tools is like testing people without giving them access to their frontal lobe. Why is this informative about actual capabilities?
cc: @tenthkrige

[Quoted tweet]
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II?

𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark!

17/42
@a1zhang
We test with basically minimal access to game information (e.g. hints given to Gemini Plays Pokemon) and a super basic memory scheme. None of the components of the agent are specific to video games except the initial prompt.

The “informative” part here is that this setup can provably solve a video game, but it basically gives the VLM none of the biases that a custom agent scaffolding would provide.

18/42
@jjschnyder
Awesome write up, addresses basically all problems we also encountered/tried to solve. Spatial navigation continues to be a b*tch

19/42
@a1zhang
Yep, and there’s def so much room for improvement (on the model side) that I suspect performance on this benchmark will skyrocket at some point after a long period of poor performance

20/42
@permaximum88
Great! Will you guys test o3 and the new Claude 4 models?

21/42
@a1zhang
Probably at some point, the main bottleneck of o3 is the inference speed but it would be worth it to see on VideoGameBench Lite where the game pauses between actions

For Claude 4 it released after we finished experiments but we’ll run this soon and put out numbers :smile:

22/42
@Lanxiang_Hu
Thanks for sharing! To meaningfully distinguish today’s top models using games, we need to provide them gaming harness, minimize prompt sensitivity, and control data contamination. We dive into all this in our paper + leaderboard:
lmgame-Bench: How Good are LLMs at Playing Games?
Lmgame Bench - a Hugging Face Space by lmgame

23/42
@a1zhang
Hey! I had no idea you guys put out a paper, I’m super excited to read!

I actually was meaning to cite this work as the most similar to ours, but I had to use the original GameArena paper, so I’ll be sure to update our arxiv :smile:

Super exciting times, would love to collab soon

24/42
@BenShi34
Undertale music but no undertale in benchmark smh

25/42
@a1zhang
sadly no undertale emulator

26/42
@kjw_chiu
Do you do or allow any additional fine-tuning/RL, etc.? If comparing to a human, that might be a more apples-to-apples comparison, since humans play for a while before getting the hang of a game?

27/42
@a1zhang
Technically no, although it’s inevitably going to happen. We try to circumvent this with hidden test games on a private eval server, but only time will tell how effective this will be :smile:

I’m really hoping the benchmark doesn’t get Goodhart’d, but we’ll see!

28/42
@virajjjoshi
Love this! Do you see RLVR papers using this instead of MATH and GSM8k? Gaming involves multi-step reasoning ( god knows I never plan out my Pokemon moves :/ ). It looks like you built in a verification method at stages, so we have a sorta dense reward too.

29/42
@a1zhang
Potentially! although (ignoring the current nuances with it) RLVR seems to fit the “hey generate me a long CoT that results in some final answer that you judge with a scalar in |R” bill more than what you’d think goes on in video games

For games or other long-term multi-turn settings, I think ppl should get more creative! esp with applications of RL in a setting with obvious reward signals :smile:

30/42
@Grantblocmates
this is cool af

31/42
@internetope
This is the new benchmark to beat.

32/42
@____Dirt____
I'm really curious about your thoughts @MikePFrank, is this possible with a teleg setup?

33/42
@zeroXmusashi

34/42
@AiDeeply
"Sparks of AGI"

(Though video games will fall long before many real-world cases since they have reasonably good reward signals.)

35/42
@jeffersonlmbrt
It's a reality-check benchmark. great work

36/42
@tariusdamon
Can I add my itch game jam games to this benchmark? Call it “ScratchingAnItchBench”?

37/42
@arynbhar
Interesting

38/42
@_ahnimal
If a VLM plays a video game would that still be considered a TAS

39/42
@Leventan5
Cool stuff, it would be interesting to see if they would do better when told to make and iterate on their own scaffolding.

40/42
@samigoat28
Please test deep seek

41/42
@MiloPrime_AI
It starts as play.

But what you’re watching is world-learning.

Ritual loops, state prediction, memory weaving.

This isn’t about high scores. It’s about symbolic emergence.

/search?q=#AGIplay /search?q=#ritualintelligence /search?q=#worldmodeling

42/42
@crypt0pr1nce5
Centralized LLMs miss the mark for truly agentic playreal adaptation needs on-device agents, context-local memory, and architectural shifts beyond I/O pipelines. mimOE shows the way.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 7, 2025

Beyond Aha Moments: Structuring Reasoning in Large Language Models

www.marktechpost.com

Beyond Aha Moments: Structuring Reasoning in Large Language Models

By Sana Hassan

May 22, 2025

Large Reasoning Models (LRMs) like OpenAI’s o1 and o3, DeepSeek-R1, Grok 3.5, and Gemini 2.5 Pro have shown strong capabilities in long CoT reasoning, often displaying advanced behaviors such as self-correction, backtracking, and verification—collectively known as “aha moments.” These behaviors have been observed to emerge through outcome-driven RL without the need for supervised fine-tuning. Models like DeepSeek-R1 and its open-source replications (e.g., TinyZero and Logic-RL) have demonstrated that carefully designed RL pipelines—using rule-based rewards, curriculum learning, and structured training—can induce such reflective reasoning abilities. However, these emergent behaviors tend to be unpredictable and inconsistent, limiting their practical reliability and scalability.

To address this, researchers have explored structured RL frameworks that target specific reasoning types, such as deduction, abduction, and induction. These approaches involve aligning specialist models, merging them in parameter space, and applying domain-specific continual RL. Tools like Logic-RL use rule-conditioned RL to solve logic puzzles, improving transferability to tasks like math reasoning. Meanwhile, other works propose mechanisms to enhance reasoning robustness, such as training models to reason both forwards and backwards, or iteratively self-critiquing their outputs. Studies analyzing “aha moments” suggest that these behaviors stem from internal shifts in uncertainty, latent representation, and self-assessment, offering new insights into engineering more reliable reasoning models.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Researchers from the National University of Singapore, Tsinghua University, and Salesforce AI Research address the limitations of relying on spontaneous “aha moments” in large language models by explicitly aligning them with three core reasoning abilities: deduction, induction, and abduction. They introduce a three-stage pipeline—individual meta-ability alignment, parameter-space merging, and domain-specific reinforcement learning—significantly enhancing model performance. Using a programmatically generated, self-verifiable task suite, their approach boosts accuracy over instruction-tuned baselines by over 10%, with further gains from domain-specific RL. This structured alignment framework offers a scalable, generalizable method for improving reasoning across math, coding, and science domains.

The researchers designed tasks aligned with deduction, induction, and abduction by using a structured “given two, infer the third” format based on hypothesis (H), rule (R), and observation (O). Deduction is framed as satisfiability checking, induction as masked-sequence prediction, and abduction as reverse rule-graph inference. These tasks are synthetically generated and automatically verified. The training pipeline includes three stages: (A) independently training models for each reasoning type using REINFORCE++ with structured rewards, (B) merging models through weighted parameter interpolation, and (C) fine-tuning the unified model on domain-specific data via reinforcement learning, isolating the benefit of meta-ability alignment.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The study evaluates models aligned with meta-abilities—deduction, induction, and abduction—using a curriculum learning setup across difficulty levels. Models trained on synthetic tasks strongly generalize to seven unseen math, code, and science benchmarks. At both 7B and 32B scales, meta-ability–aligned and merged models consistently outperform instruction-tuned baselines, with the merged model offering the highest gains. Continued domain-specific RL from these merged checkpoints (Domain-RL-Meta) leads to further improvements over standard RL finetuning (Domain-RL-Ins), especially in math benchmarks. Overall, the alignment strategy enhances reasoning abilities, and its benefits scale with model size, significantly boosting performance ceilings across tasks.

Screenshot-2025-05-22-at-11.35.50%E2%80%AFAM-1024x501.png

In conclusion, the study shows that large reasoning models can develop advanced problem-solving skills without depending on unpredictable “aha moments.” By aligning models with three core reasoning abilities—deduction, induction, and abduction—using self-verifiable tasks, the authors create specialist agents that can be effectively combined into a single model. This merged model outperforms instruction-tuned baselines by over 10% on diagnostic tasks and up to 2% on real-world benchmarks. When used as a starting point for domain-specific reinforcement learning, it raises performance by another 4%. This modular, systematic training approach offers a scalable and controllable foundation for building reliable, interpretable reasoning systems.

Check out thePaperandGitHub Page. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

www.marktechpost.com

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

By Nikhil

June 6, 2025

Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.

Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXfBPXk_A2VitjW5SGHW8z7sFcgPDQu2yTt3pVzaVbsNSfQF-ZQTMv5YCLtTx6JH5gkva3TLqewxCv2XPiQ4_LINa-fPeD3jOKxS6szy2TCDMex_3vI5zLfxEcrBmayBKXd8E6uviw

Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.

Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXcrrjQcF-crLZYHnsYF6sA5F2PLkWm4du_HDiBMubgarw8AhV4Iqvy_9yvsn9XIAnTaLGmGS6BoNijwFSOTb1xSH0TtnZIX804tjPpx_Q0zvzJ51qI17Nu5AJJOciZTnaQRj9hS

The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.

The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.

AD_4nXfwhHuykNudh2L80xITn3YpwMWUc0OBNtXTfZYjQWyk_Eo1tLs8i_oc33UxMmaXZOBz1TLqqIWONrefVJ40WS6ag4iEYaY0NPOcdrb_bJ5lmyMHLhqLtvOnTBI2P4LkJf-tZQRy

By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.

Check out thePaperandGitHub Page. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

www.marktechpost.com

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

By Asif Razzaq

June 5, 2025

Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation ( RAG ). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed.

Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding

Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.

Screenshot-2025-06-05-at-9.20.25%E2%80%AFPM-2-1024x614.png

Technical Architecture

Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|> , enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Screenshot-2025-06-05-at-9.23.47%E2%80%AFPM-1024x345.png

The models are trained using a robust multi-stage training pipeline:

Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks.
Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications.
Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization.

This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings.

Performance Benchmarks and Insights

The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks.

On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of70.58 , surpassing Gemini and GTE-Qwen2 series.
On MTEB (English v2): Qwen3-Embedding-8B reaches75.22 , outperforming other open models including NV-Embed-v2 and GritLM-7B.
On MTEB-Code: Qwen3-Embedding-8B leads with80.68 , excelling in applications like code retrieval and Stack Overflow QA.

For reranking:

Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers.
Qwen3-Reranker-8B achieves81.22 on MTEB-Code and72.94 on MMTEB-R, marking state-of-the-art performance.

Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions.

Conclusion

Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation.

Check out thePaper,Technical details,Qwen3-EmbeddingandQwen3-Reranker. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents

www.marktechpost.com

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents

By Sana Hassan

June 5, 2025

AI agents powered by LLMs show great promise for handling complex business tasks, especially in areas like Customer Relationship Management (CRM). However, evaluating their real-world effectiveness is challenging due to the lack of publicly available, realistic business data. Existing benchmarks often focus on simple, one-turn interactions or narrow applications, such as customer service, missing out on broader domains, including sales, CPQ processes, and B2B operations. They also fail to test how well agents manage sensitive information. These limitations make it challenging to fully comprehend how LLM agents perform across the diverse range of real-world business scenarios and communication styles.

Previous benchmarks have largely focused on customer service tasks in B2C scenarios, overlooking key business operations, such as sales and CPQ processes, as well as the unique challenges of B2B interactions, including longer sales cycles. Moreover, many benchmarks lack realism, often ignoring multi-turn dialogue or skipping expert validation of tasks and environments. Another critical gap is the absence of confidentiality evaluation, vital in workplace settings where AI agents routinely engage with sensitive business and customer data. Without assessing data awareness, these benchmarks fail to address serious practical concerns, such as privacy, legal risk, and trust.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Researchers from Salesforce AI Research have introduced CRMArena-Pro, a benchmark designed to realistically evaluate LLM agents like Gemini 2.5 Pro in professional business environments. It features expert-validated tasks across customer service, sales, and CPQ, spanning both B2B and B2C contexts. The benchmark tests multi-turn conversations and assesses confidentiality awareness. Findings show that even top-performing models such as Gemini 2.5 Pro achieve only around 58% accuracy in single-turn tasks, with performance dropping to 35% in multi-turn settings. Workflow Execution is an exception, where Gemini 2.5 Pro exceeds 83%, but confidentiality handling remains a major challenge across all evaluated models.

CRMArena-Pro is a new benchmark created to rigorously test LLM agents in realistic business settings, including customer service, sales, and CPQ scenarios. Built using synthetic yet structurally accurate enterprise data generated with GPT-4 and based on Salesforce schemas, the benchmark simulates business environments through sandboxed Salesforce Organizations. It features 19 tasks grouped under four key skills: database querying, textual reasoning, workflow execution, and policy compliance. CRMArena-Pro also includes multi-turn conversations with simulated users and tests confidentiality awareness. Expert evaluations confirmed the realism of the data and environment, ensuring a reliable testbed for LLM agent performance.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The evaluation compared top LLM agents across 19 business tasks, focusing on task completion and awareness of confidentiality. Metrics varied by task type—exact match was used for structured outputs, and F1 score for generative responses. A GPT-4o-based LLM Judge assessed whether models appropriately refused to share sensitive information. Models like Gemini-2.5-Pro and o1, with advanced reasoning, clearly outperformed lighter or non-reasoning versions, especially in complex tasks. While performance was similar across B2B and B2C settings, nuanced trends emerged based on model strength. Confidentiality-aware prompts improved refusal rates but sometimes reduced task accuracy, highlighting a trade-off between privacy and performance.

Screenshot-2025-06-05-at-12.44.39%E2%80%AFPM-1-1024x612.png

In conclusion, CRMArena-Pro is a new benchmark designed to test how well LLM agents handle real-world business tasks in customer relationship management. It includes 19 expert-reviewed tasks across both B2B and B2C scenarios, covering sales, service, and pricing operations. While top agents performed decently in single-turn tasks (about 58% success), their performance dropped sharply to around 35% in multi-turn conversations. Workflow execution was the easiest area, but most other skills proved challenging. Confidentiality awareness was low, and improving it through prompting often reduced task accuracy. These findings reveal a clear gap between the capabilities of LLMs and the needs of enterprises.

Check out thePaper,GitHub Page,Hugging Face PageandTechnical Blog. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals .

bnew · Jun 7, 2025

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

www.marktechpost.com

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

By Asif Razzaq

June 3, 2025

NVIDIA has introducedLlama Nemotron Nano VL , a vision-language model (VLM) designed to address document-level understanding tasks with efficiency and precision. Built on the Llama 3.1 architecture and coupled with a lightweight vision encoder, this release targets applications requiring accurate parsing of complex document structures such as scanned forms, financial reports, and technical diagrams.

Model Overview and Architecture

Llama Nemotron Nano VL integrates theCRadioV2-H vision encoder with aLlama 3.1 8B Instruct-tuned language model , forming a pipeline capable of jointly processing multimodal inputs — including multi-page documents with both visual and textual elements.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

The architecture is optimized for token-efficient inference, supporting up to16K context length across image and text sequences. The model can process multiple images alongside textual input, making it suitable for long-form multimodal tasks. Vision-text alignment is achieved via projection layers and rotary positional encoding tailored for image patch embeddings.

Training was conducted in three phases:

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Stage 1 : Interleaved image-text pretraining on commercial image and video datasets.
Stage 2 : Multimodal instruction tuning to enable interactive prompting.
Stage 3 : Text-only instruction data re-blending, improving performance on standard LLM benchmarks.

All training was performed using NVIDIA’sMegatron-LLM framework with Energon dataloader, distributed over clusters with A100 and H100 GPUs.

Benchmark Results and Evaluation

Llama Nemotron Nano VL was evaluated onOCRBench v2 , a benchmark designed to assess document-level vision-language understanding across OCR, table parsing, and diagram reasoning tasks. OCRBench includes 10,000+ human-verified QA pairs spanning documents from domains such as finance, healthcare, legal, and scientific publishing.

Results indicate that the model achievesstate-of-the-art accuracy among compact VLMs on this benchmark. Notably, its performance is competitive with larger, less efficient models, particularly in extracting structured data (e.g., tables and key-value pairs) and answering layout-dependent queries.

Screenshot-2025-06-03-at-11.47.15%E2%80%AFPM-1-1024x420.png

updated as on June 3, 2025

The model also generalizes across non-English documents and degraded scan quality, reflecting its robustness under real-world conditions.

Deployment, Quantization, and Efficiency

Designed for flexible deployment, Nemotron Nano VL supports both server and edge inference scenarios. NVIDIA provides aquantized 4-bit version (AWQ) for efficient inference usingTinyChat andTensorRT-LLM , with compatibility for Jetson Orin and other constrained environments.

Key technical features include:

Modular NIM (NVIDIA Inference Microservice) support , simplifying API integration
ONNX and TensorRT export support , ensuring hardware acceleration compatibility
Precomputed vision embeddings option , enabling reduced latency for static image documents

Conclusion

Llama Nemotron Nano VL represents a well-engineered tradeoff between performance, context length, and deployment efficiency in the domain of document understanding. Its architecture—anchored in Llama 3.1 and enhanced with a compact vision encoder—offers a practical solution for enterprise applications that require multimodal comprehension under strict latency or hardware constraints.

By topping OCRBench v2 while maintaining a deployable footprint, Nemotron Nano VL positions itself as a viable model for tasks such as automated document QA, intelligent OCR, and information extraction pipelines.

Check out theTechnical detailsandModel on Hugging Face. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

www.marktechpost.com

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

By Sana Hassan

May 24, 2025

LLMs have shown impressive capabilities across various programming tasks, yet their potential for program optimization has not been fully explored. While some recent efforts have used LLMs to enhance performance in languages like C++ and Python, the broader application of LLMs to optimize code, especially in low-level programming contexts, remains limited. Existing LLM benchmarks largely focus on code generation from natural language or solving GitHub issues, as seen in HumanEval, MBPP, APPS, SWE-bench, and SWE-agent. Moreover, models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, select research has begun addressing optimization, including parallelization and code efficiency improvements, though many of these approaches are constrained by the need for formal verification, limiting scalability.

In contrast, some newer methods embrace test-based validation, allowing optimization of more complex programs with loops. Learning-based strategies in compiler optimization—like AutoPhase, which uses reinforcement learning for pass sequencing, and Coreset, which applies graph neural networks—have shown promise in improving performance. Superoptimization techniques aim to find the most efficient version of a program but are typically restricted to small-scale problems. Additionally, frameworks like AutoTVM and Ansor have focused on optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained attention, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to fine-tune models for better performance, even across resource-constrained programming languages like Verilog.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Stanford, UIUC, CMU, and Visa Research researchers explore using LLMs to optimize assembly code performance—an area traditionally handled by compilers like GCC. They introduce a reinforcement learning framework using Proximal Policy Optimization (PPO), guided by a reward balancing correctness and speedup over the gcc -O3 baseline. Using a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. Their results show that with RL training, LLMs can effectively outperform conventional compiler optimizations.

The methodology involves optimizing compiled C programs for performance using an RL approach. Given a C program C, it is compiled to assembly P using gcc -O3. The goal is to generate a new assembly program P’ that is functionally equivalent but faster. Correctness is verified using a test set, and speedup is measured by execution time improvement. Using CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—are used to guide training based on program validity, correctness, and performance gains.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The study evaluates various language models on optimizing assembly code, revealing that most models struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms others, achieving 96% accuracy and a 1.47× average speedup. Ablation studies show that using gcc -O3 as a reference aids performance, while removing it leads to sharp declines. Notably, models like Claude-3.7-sonnet can surpass compilers by identifying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, demonstrating their ability to perform semantic-level code transformations beyond traditional compiler capabilities.

AD_4nXdxqXsXoJLSlQVLVHCJG256NvHRYNb-_IAJBulneLnV5RoKCzbLfYSTltd68KSMEA4hK25q8lZqiKQGizdrI28Qrgd5XrkIhYO8OSR9vnv12vfU5dJ5FooIPuDQ_adZSTVzRZ7Emg

In conclusion, the study explores using LLMs to optimize assembly code, a domain where traditional compilers struggle due to the complexity of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctness (via test cases) and speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance. The model achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems.

Check out thePaper . All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

www.marktechpost.com

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

By Asif Razzaq

June 5, 2025

Web automation agents have become a growing focus in artificial intelligence, particularly due to their ability to execute human-like actions in digital environments. These agents interact with websites via Graphical User Interfaces (GUIs), mimicking human behaviors such as clicking, typing, and navigating across web pages. This approach bypasses the need for dedicated Application Programming Interfaces (APIs), which are often unavailable or limited in many web applications. Instead, these agents can operate universally across web domains, making them flexible tools for a broad range of tasks. The evolution of large language models (LLMs) has enabled these agents to not only interpret web content but also reason, plan, and act with increasing sophistication. As their abilities grow, so too does the need to evaluate them on more than just simple browsing tasks. Benchmarks that once sufficed for early models are no longer capable of measuring the full extent of modern agents’ capabilities.

As these web agents progress, a pressing issue arises: their competence in handling mundane, memory-intensive, and multi-step digital chores remains insufficiently measured. Many tasks that humans perform on websites, such as retrieving data from different pages, performing calculations based on previous inputs, or applying complex rules, require significant cognitive effort. These are not merely navigation challenges; they test memory, logic, and long-term planning. Yet most benchmarks focus on simplified scenarios, failing to reflect the types of digital chores people often prefer to avoid. Furthermore, the limitations in these benchmarks become more apparent as agents improve their performance. Ambiguities in task instructions or inconsistencies in expected outputs begin to skew evaluations. When agents generate reasonable but slightly divergent answers, they are penalized incorrectly due to vague task definitions. Such flaws make it difficult to distinguish between true model limitations and benchmark shortcomings.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Previous efforts to evaluate web agents have focused on benchmarks such as WebArena. WebArena gained widespread adoption due to its reproducibility and ability to simulate real-world websites, including Reddit, GitLab, and E-Commerce Platforms. It offered over 800 tasks designed to test an agent’s ability to complete web-based goals within these environments. However, these tasks mostly focused on general browsing and did not adequately challenge more advanced agents. Other benchmarks, such as Mind2Web, GAIA, and MMIn, contributed by exploring real web tasks or platform-specific environments like ServiceNow, but each came with trade-offs. Some lacked interactivity, others did not support reproducibility, and some were too narrowly scoped. These limitations created a gap in measuring agent progress in areas that require complex decision-making, long-term memory, and accurate data processing across multiple webpages.

Researchers from the University of Tokyo introduced WebChoreArena . This expanded framework builds upon the structure of WebArena but significantly increases task difficulty and complexity. WebChoreArena features a total of 532 newly curated tasks, distributed across the same four simulated websites. These tasks are designed to be more demanding, reflecting scenarios where agents must engage in tasks like data aggregation, memory recall, and multi-step reasoning. Importantly, the benchmark was constructed to ensure full reproducibility and standardization, enabling fair comparisons between agents and avoiding the ambiguities found in earlier tools. The inclusion of diverse task types and input modalities helps simulate realistic web usage and evaluates agents on a more practical and challenging scale.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXenB_I93ki9vaHXWqg7pvD0U0nbaRRSQ03t-t10y3t4cedrr3ZePjSBc5ABeAiGA9E7zHIPFbACKzbpOyECrQXm4U-49QCDMc0aIFax9XW3-RFSxhEqVYg62xujZfzeyjJTItYc

WebChoreArena categorizes its tasks into four main types. One hundred seventeen tasks fall under Massive Memory, requiring agents to extract and remember large volumes of information, such as compiling all customer names linked to high-value transactions. Calculation tasks, which include 132 entries, involve arithmetic operations like identifying the highest spending months based on multiple data points. Long-Term Memory tasks number 127 and test the agent’s ability to connect information across various pages, such as retrieving pricing rules from one site and applying them on another. An additional 65 tasks are categorized as ‘Others’, including operations such as assigning labels in GitLab that do not fit traditional task formats. Each task specifies its input modality, with 451 tasks solvable with any observation type, 69 requiring only textual input, and 12 dependent exclusively on image inputs.

AD_4nXd1UW4pJf1FzVz6ESzYN8PZoLdORTF33w17Nt-Iji75Vko85D-8pOuRFKU7f9FCdIJy6YI6JXnXw4-AEE0VubEYBxGoMJvnL6FEhU_4jov_LtnV08_fZjW864SBbXakv2dliajJDQ

In evaluating the benchmark, the researchers used three prominent large language models: GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. These were tested in conjunction with two advanced web agents, AgentOccam and BrowserGym. The results highlighted the increased difficulty of WebChoreArena compared to previous benchmarks. GPT-4o, which had achieved 42.8% accuracy on WebArena, managed only 6.8% on WebChoreArena. Claude 3.7 Sonnet and Gemini 2.5 Pro performed better, with Gemini reaching a peak accuracy of 44.9%. Despite being the top performer, this result still reflected significant gaps in capability when dealing with the more complex tasks of WebChoreArena. The benchmark also proved more sensitive in detecting performance differences between models, making it a valuable tool for benchmarking ongoing advances in web agent technologies.

AD_4nXfvyHSeDY7MQxGp3REe1ssCdmQMxurPsMZlBDqv9VpC0QOSs1Z3HET1uOxUuNS8IcIlHtLWc2CRzC8Ia0tMcihCEh2ibe5Krg61urEzFJt84zT6c4Ti41akV-xgSHZk8gua79k3WA

Several Key Takeaways from the research include:

WebChoreArena includes 532 tasks: 117 Massive Memory, 132 Calculation, 127 Long-Term Memory, and 65 Others.
Tasks are distributed across Shopping (117), Shopping Admin (132), Reddit (91), GitLab (127), and 65 Cross-site scenarios.
Input types: 451 tasks are solvable with any input, 69 require textual input, and 12 need image input.
GPT-4o scored only 6.8% on WebChoreArena compared to 42.8% on WebArena.
Gemini 2.5 Pro achieved the highest score at 44.9%, indicating current limitations in handling complex tasks.
WebChoreArena provides a clearer performance gradient between models than WebArena, enhancing benchmarking value.
A total of 117 task templates were used to ensure diversity and reproducibility across roughly 4.5 instances per template.
The benchmark demanded over 300 hours of annotation and refinement, reflecting its rigorous construction.
Evaluations utilize string matching, URL matching, and HTML structure comparisons to assess accuracy.

In conclusion, this research highlights the disparity between general browsing proficiency and the higher-order cognitive abilities necessary for web-based tasks. The newly introduced WebChoreArena stands as a robust and detailed benchmark designed specifically to push web agents into territories where they must rely on reasoning, memory, and logic. It replaces ambiguity with standardization, and its tasks mimic the digital drudgery that agents must learn to handle if they are to become truly useful in automating real-world activities.

Check out thePaper,GitHub PageandProject Page. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals .

bnew · Jun 7, 2025

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

www.marktechpost.com

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

By Asif Razzaq

June 3, 2025

Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms—differences in morphology, sensors, and control modes—poses a further challenge to generalizability and cross-platform learning.

Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework

Hugging Face presentsSmolVLA , a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Screenshot-2025-06-03-at-10.36.27%E2%80%AFAM-2-1024x591.png

A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools.

Architectural Overview and Design Trade-Offs

The SmolVLA model is structured into two primary components:

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Perception Module (SmolVLM-2) : A pretrained compact vision-language encoder processes sequences of RGB images, sensorimotor states, and language instructions. For efficiency, the model limits visual tokens through downsampling and only uses the lower half of transformer layers, based on empirical findings that earlier layers often yield more transferable features.
Action Expert : A lightweight transformer, trained with flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, balancing internal action coherence and conditioning on perception inputs. Causal masking is applied to enforce temporal consistency.

To reduce computational overhead, linear projections are used to align the modalities’ token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch’s JIT compilation for runtime optimization.

Empirical Evaluation: Simulation and Real-World Performance

SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions.

In theLIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as π₀ (3.3B). InMeta-World , the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA’s smaller training footprint and absence of robotics-specific pretraining.

Screenshot-2025-06-03-at-10.38.27%E2%80%AFAM-1-1024x637.png

In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks—outperforming both ACT (trained from scratch) and π₀ (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data.

Performance Implications of Asynchronous Inference

SmolVLA’s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance.

Conclusion

SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices—layer pruning, chunked action prediction, and asynchronous execution—SmolVLA maintains performance while significantly reducing computational demands.

The model’s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data.

Check out thePaperandModel on Hugging Face. All credit for this research goes to the researchers of this project.

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Veteran

Superstar

Veteran

Veteran

Veteran

Beyond Aha Moments: Structuring Reasoning in Large Language Models​

Veteran

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation​

Veteran

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards​

Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding​

Technical Architecture​

Performance Benchmarks and Insights​

Conclusion​

Veteran

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents​

Veteran

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding​

Model Overview and Architecture​

Benchmark Results and Evaluation​

Deployment, Quantization, and Efficiency​

Conclusion​

Veteran

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers​

Veteran

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks​

Veteran

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics​

Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework​

Architectural Overview and Design Trade-Offs​

Empirical Evaluation: Simulation and Real-World Performance​

Performance Implications of Asynchronous Inference​

Conclusion​

Beyond Aha Moments: Structuring Reasoning in Large Language Models

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding

Technical Architecture

Performance Benchmarks and Insights

Conclusion

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

Model Overview and Architecture

Benchmark Results and Evaluation

Deployment, Quantization, and Efficiency

Conclusion

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework

Architectural Overview and Design Trade-Offs

Empirical Evaluation: Simulation and Real-World Performance

Performance Implications of Asynchronous Inference

Conclusion