Large Language Models News & Discussions

bnew · Jul 10, 2025

1/6
@rohanpaul_ai
Microsoft just dropped Phi-4-mini-flash-reasoning.

- built on a new hybrid architecture,
- 10X higher throughput and a 2 to 3X reduction in latency
- significantly faster inference without sacrificing reasoning performance.

Microsoft swaps most of that heavy work for a lean SambaY layout with tiny gating blocks, so the same 3.8B parameters think quicker and type sooner.

The quick idea

Phi‑4‑mini‑flash‑reasoning keeps size small at 3.8B parameters but rebuilds the flow of information.

A new decoder‑hybrid‑decoder stack called SambaY lets light recurrent pieces handle context, a single full‑attention layer handles global glue, and cheap little Gated Memory Units (GMUs) recycle that work all the way down the stack.

2/6
@rohanpaul_ai
microsoft/Phi-4-mini-flash-reasoning · Hugging Face

Official Blog - Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning | Microsoft Azure Blog

3/6
@Chaos2Cured
Told everyone. Microsoft is blazing. •

4/6
@rohanpaul_ai
yes

5/6
@JonGuze
Asking again, how can I find out what you and other AI people mean by "inference" and "reasoning"?

6/6
@techfusionjb
Love the speed boost!

But does this lean model architecture juggle both efficiency and adaptability well? Curious on the real-world versatility beyond benchmarks!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 13, 2025

1/11
@Kimi_Moonshot

Hello, Kimi K2! Open-Source Agentic Model!

1T total / 32B active MoE model

SOTA on SWE Bench Verified, Tau2 & AceBench among open models

Strong in coding and agentic tasks

Multimodal & thought-mode not supported for now

With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build!

API is here: Moonshot AI - Open Platform
- $0.15 / million input tokens (cache hit)
- $0.60 / million input tokens (cache miss)
- $2.50 / million output tokens

Tech blog: Kimi K2: Open Agentic Intelligence

Weights & code: moonshotai (Moonshot AI)

Github: GitHub - MoonshotAI/Kimi-K2: Kimi K2 is the large language model series developed by Moonshot AI team
Try it now at Kimi - 会推理解析，能深度思考的AI助手 or via API!

2/11
@Kimi_Moonshot
Here are some vibe tests we ran:

1. Interactive 3D Mountain Scene

3/11
@Kimi_Moonshot
2. A ball bouncing in hexagon

4/11
@Kimi_Moonshot
3. Visual Analysis of Remote Work and Salary Trends

*Agent capabilities available via API. More tools coming soon on Kimi - 会推理解析，能深度思考的AI助手

5/11
@Kimi_Moonshot
4. 3D Particle Galaxy Simulation

6/11
@Kimi_Moonshot
5. Coldplay 2025 Concert Trip Planner

*Agent capabilities available via API. More tools coming soon on Kimi - 会推理解析，能深度思考的AI助手

7/11
@richardcsuwandi
This can't be a coincidence, right? @elonmusk

8/11
@Presstab_crypto
Is this cypher-alpha?

9/11
@philip_kiely
.@simonw Not the best pelican I've seen, not the worst. wdyt?

10/11
@alhasaniq
MuonClip is actually huge if it proves to scale to 1T params reliably.

Seems like a respectable step forward over AdamW

11/11
@iamfakhrealam
Congratulations to the team

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 13, 2025

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free

Chinese AI startup Moonshot releases open-source Kimi K2 model that outperforms OpenAI and Anthropic on coding tasks with breakthrough agentic capabilities and competitive pricing.

venturebeat.com

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free

Michael Nuñez@MichaelFNunez

July 11, 2025 3:56 PM

Credit: VentureBeat made with Midjourney

Moonshot AI, the Chinese artificial intelligence startup behind the popular Kimi chatbot, released an open-source language model on Friday that directly challenges proprietary systems from OpenAI and Anthropic with particularly strong performance on coding and autonomous agent tasks.

The new model, called Kimi K2, features 1 trillion total parameters with 32 billion activated parameters in a mixture-of-experts architecture. The company is releasing two versions: a foundation model for researchers and developers, and an instruction-tuned variant optimized for chat and autonomous agent applications.

? Hello, Kimi K2! Open-Source Agentic Model!

? 1T total / 32B active MoE model

? SOTA on SWE Bench Verified, Tau2 & AceBench among open models

?Strong in coding and agentic tasks

? Multimodal & thought-mode not supported for now

With Kimi K2, advanced agentic intelligence… pic.twitter.com/PlRQNrg9JL

— Kimi.ai (@Kimi_Moonshot) July 11, 2025

“Kimi K2 does not just answer; it acts,” the company stated in its announcement blog. “With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can’t wait to see what you build.”

The model’s standout feature is its optimization for “agentic” capabilities — the ability to autonomously use tools, write and execute code, and complete complex multi-step tasks without human intervention. In benchmark tests, Kimi K2 achieved 65.8% accuracy on SWE-bench Verified, a challenging software engineering benchmark, outperforming most open-source alternatives and matching some proprietary models.

David meets Goliath: How Kimi K2 outperforms Silicon Valley’s billion-dollar models

The performance metrics tell a story that should make executives at OpenAI and Anthropic take notice. Kimi K2-Instruct doesn’t just compete with the big players — it systematically outperforms them on tasks that matter most to enterprise customers.

On LiveCodeBench, arguably the most realistic coding benchmark available, Kimi K2 achieved 53.7% accuracy, decisively beating DeepSeek-V3‘s 46.9% and GPT-4.1‘s 44.7%. More striking still: it scored 97.4% on MATH-500 compared to GPT-4.1’s 92.4%, suggesting Moonshot has cracked something fundamental about mathematical reasoning that has eluded larger, better-funded competitors.

But here’s what the benchmarks don’t capture: Moonshot is achieving these results with a model that costs a fraction of what incumbents spend on training and inference. While OpenAI burns through hundreds of millions on compute for incremental improvements, Moonshot appears to have found a more efficient path to the same destination. It’s a classic innovator’s dilemma playing out in real time — the scrappy outsider isn’t just matching the incumbent’s performance, they’re doing it better, faster, and cheaper.

The implications extend beyond mere bragging rights. Enterprise customers have been waiting for AI systems that can actually complete complex workflows autonomously, not just generate impressive demos. Kimi K2’s strength on SWE-bench Verified suggests it might finally deliver on that promise.

The MuonClip breakthrough: Why this optimizer could reshape AI training economics

Buried in Moonshot’s technical documentation is a detail that could prove more significant than the model’s benchmark scores: their development of the MuonClip optimizer, which enabled stable training of a trillion-parameter model “with zero training instability.”

This isn’t just an engineering achievement — it’s potentially a paradigm shift. Training instability has been the hidden tax on large language model development, forcing companies to restart expensive training runs, implement costly safety measures, and accept suboptimal performance to avoid crashes. Moonshot’s solution directly addresses exploding attention logits by rescaling weight matrices in query and key projections, essentially solving the problem at its source rather than applying band-aids downstream.

The economic implications are staggering. If MuonClip proves generalizable — and Moonshot suggests it is — the technique could dramatically reduce the computational overhead of training large models. In an industry where training costs are measured in tens of millions of dollars, even modest efficiency gains translate to competitive advantages measured in quarters, not years.

More intriguingly, this represents a fundamental divergence in optimization philosophy. While Western AI labs have largely converged on variations of AdamW, Moonshot’s bet on Muon variants suggests they’re exploring genuinely different mathematical approaches to the optimization landscape. Sometimes the most important innovations come not from scaling existing techniques, but from questioning their foundational assumptions entirely.

Open source as competitive weapon: Moonshot’s radical pricing strategy targets big tech’s profit centers

Moonshot’s decision to open-source Kimi K2 while simultaneously offering competitively priced API access reveals a sophisticated understanding of market dynamics that goes well beyond altruistic open-source principles.

At $0.15 per million input tokens for cache hits and $2.50 per million output tokens, Moonshot is pricing aggressively below OpenAI and Anthropic while offering comparable — and in some cases superior — performance. But the real strategic masterstroke is the dual availability: enterprises can start with the API for immediate deployment, then migrate to self-hosted versions for cost optimization or compliance requirements.

This creates a trap for incumbent providers. If they match Moonshot’s pricing, they compress their own margins on what has been their most profitable product line. If they don’t, they risk customer defection to a model that performs just as well for a fraction of the cost. Meanwhile, Moonshot builds market share and ecosystem adoption through both channels simultaneously.

The open-source component isn’t charity — it’s customer acquisition. Every developer who downloads and experiments with Kimi K2 becomes a potential enterprise customer. Every improvement contributed by the community reduces Moonshot’s own development costs. It’s a flywheel that leverages the global developer community to accelerate innovation while building competitive moats that are nearly impossible for closed-source competitors to replicate.

From demo to reality: Why Kimi K2’s agent capabilities signal the end of chatbot theater

The demonstrations Moonshot shared on social media reveal something more significant than impressive technical capabilities—they show AI finally graduating from parlor tricks to practical utility.

Consider the salary analysis example: Kimi K2 didn’t just answer questions about data, it autonomously executed 16 Python operations to generate statistical analysis and interactive visualizations. The London concert planning demonstration involved 17 tool calls across multiple platforms — search, calendar, email, flights, accommodations, and restaurant bookings. These aren’t curated demos designed to impress; they’re examples of AI systems actually completing the kind of complex, multi-step workflows that knowledge workers perform daily.

This represents a philosophical shift from the current generation of AI assistants that excel at conversation but struggle with execution. While competitors focus on making their models sound more human, Moonshot has prioritized making them more useful. The distinction matters because enterprises don’t need AI that can pass the Turing test—they need AI that can pass the productivity test.

The real breakthrough isn’t in any single capability, but in the seamless orchestration of multiple tools and services. Previous attempts at “agent” AI required extensive prompt engineering, careful workflow design, and constant human oversight. Kimi K2 appears to handle the cognitive overhead of task decomposition, tool selection, and error recovery autonomously—the difference between a sophisticated calculator and a genuine thinking assistant.

The great convergence: When open source models finally caught the leaders

Kimi K2’s release marks an inflection point that industry observers have predicted but rarely witnessed: the moment when open-source AI capabilities genuinely converge with proprietary alternatives.

Unlike previous “GPT killers” that excelled in narrow domains while failing on practical applications, Kimi K2 demonstrates broad competence across the full spectrum of tasks that define general intelligence. It writes code, solves mathematics, uses tools, and completes complex workflows—all while being freely available for modification and self-deployment.

This convergence arrives at a particularly vulnerable moment for the AI incumbents. OpenAI faces mounting pressure to justify its $300 billion valuation while Anthropic struggles to differentiate Claude in an increasingly crowded market. Both companies have built business models predicated on maintaining technological advantages that Kimi K2 suggests may be ephemeral.

The timing isn’t coincidental. As transformer architectures mature and training techniques democratize, the competitive advantages increasingly shift from raw capability to deployment efficiency, cost optimization, and ecosystem effects. Moonshot seems to understand this transition intuitively, positioning Kimi K2 not as a better chatbot, but as a more practical foundation for the next generation of AI applications.

The question now isn’t whether open-source models can match proprietary ones—Kimi K2 proves they already have. The question is whether the incumbents can adapt their business models fast enough to compete in a world where their core technology advantages are no longer defensible. Based on Friday’s release, that adaptation period just got considerably shorter.

bnew · Jul 18, 2025

1/11
@maviadam

This open-source AI video generator is INSANE!

Flood /search?q=#MeiGenMultiTalk
MeiGen-MultiTalk – a game-changer for video creation!
and singing ...

RT the post below to share with your audience!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/21
@angrypenguinPNG
Okay this is actually insane

A new lip sync method called MultiTalk just dropped and it's the best open-source quality I've ever seen.

Now works in ComfyUI!

https://video.twimg.com/amplify_video/1936543801717428224/vid/avc1/1280x704/ivga-i01Om_PDV13.mp4

2/21
@DreamStarter_1
For real?
I've seen Purz's stream and the results were not THIS good.

3/21
@angrypenguinPNG
i’ll be honest, i’ve yet to test locally on my own comp! this is from research paper :P

will check out Purz stream!

4/21
@1littlecoder
not yet on @FAL i guess? cc @jfischoff

5/21
@zsakib_
wait this isn't just the music video?

6/21
@CoffeeVectors
Ok now!!!

7/21
@codewithimanshu
Game-changer for AI-generated content.

8/21
@kiieford
the detail in the neck vein strain

so good

9/21
@hey1tspriyanshu
how do you create long videos like 18 seconds with consistency. curious to know your workflow

10/21
@ttplanet
could you share how you set the promot to move the camera？my video stay solid with same background！ Thanks

11/21
@Vijish68859437
Can @HeyGen_Official do this?
Opensource seems better.

12/21
@NewWaveAi2023
Have you got a link to the workflow?

13/21
@JoseMGalaZ
Can you share the model and the wf?

14/21
@paulbri1
ohhh shyt

15/21
@geoppls
What country is it from?

16/21
@SF5321916436736
A‘ūdhu billāhi min ash-shayṭāni r-rajīm

17/21
@nglprz

18/21
@AgentFola

19/21
@johndpope
we need the training code - Pull requests · MeiGen-AI/MultiTalk

20/21
@degengates
now im about to believe everything last 20 years in TV was all fake AI politicians deepstate military stuff mk ultra... and they now just told us that its all possible but they know it actually a long time ago already

21/21
@niushidazuan
3uR46fHUkJp8T7FxVk6xnM6a5m3zpPyi1d3QHeqCpump 非常棒的项目，真叙事非常好，让每一笔交易都变得有意义，加入我们，一起建设他。

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@Norris29973102
Our New Open-Source Project: MeiGen-MultiTalk， multi-person conversational video generation
enables multi-person conversations

, singing

, interaction control

, and cartoon creation

.

MultiTalk

GitHub - MeiGen-AI/MultiTalk: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Code and model is coming~

https://video.twimg.com/amplify_video/1927995490899218434/vid/avc1/1250x704/g_4gED_Lbqyj7QfS.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@Norris29973102

MeiGen-MultiTalk: the code and checkpoints are released. Enjoy!
Github: GitHub - MeiGen-AI/MultiTalk: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Hugging Face: MeiGen-AI/MeiGen-MultiTalk · Hugging Face

https://video.twimg.com/amplify_video/1932370627157192704/vid/avc1/720x940/IIFA-9m7AIA54vS3.mp4

2/4
@TomLikesRobots
Nice - some of these demos look great. Are you planning on a Comfy implementation?
The long video format looks interesting. It seems to use a sliding window with a fixed 3.2-second overlap and an audio cross-attention channel riding alongside.

https://video.twimg.com/amplify_video/1932451514762563584/vid/avc1/1250x704/NJZWUvTJZK7pBwii.mp4

3/4
@OriZilbershtein
Hey Yong, how can one reach you? :D Thanks

4/4
@wangleineo
Good work! What kind of hardware does it require if I want to run this locally?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@SD_Tutorial
Generate your talking avatar with:
Multitalk+ Wan2.1

1. Setup Kijai's Wan setup:
Wan 2.1: Install & Generate Videos/Images locally with lower VRAM

2. Download Kijai's released model:
WanVideo_2_1_Multitalk_14B_fp8_e4m3fn.safetensors · Kijai/WanVideo_comfy at main

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@kurtqian
multitalk text to speech which built on eSpeak NG TTS synthesizer

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@mix_buchi_
こちらはMultitalkによるlipsync

Anime style
Flux.1 -> Wan2.1 I2V with multitalk

/search?q=#AIart
/search?q=#FLUX1
/search?q=#AImovie
/search?q=#Wan21
/search?q=#multitalk
/search?q=#ComfyUI

https://video.twimg.com/amplify_video/1945439821503279105/vid/avc1/480x640/aJ1SnbIHy30P2c_e.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@tsmooth2k1
Wan 2.1 Vace Multitalk Fusion X, Just doing a little test.

https://video.twimg.com/amplify_video/1945630621549666305/vid/avc1/720x720/73rR3LlNGXE8CpxX.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@creator_kachun
Research - gen A.I. character singing + pose control with WanVideo MultiTalk + VACE

Song: 許美靜 - 傾城

https://video.twimg.com/amplify_video/1945708528817356800/vid/avc1/720x480/ooPdwKgzCZQW8_Ys.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/8
@fffiloni
Meigen MultiTalk @gradio demo is available on @huggingface

Duplicate on L40S for personal and unlimited inference, enjoy !

*Compatible with multi-GPU too

2/8
@fffiloni
Meigen MultiTalk HF Space: https://huggingface.co/spaces/fffiloni/Meigen-MultiTalk

3/8
@codewithimanshu
Personal, unlimited inference on L40S? Impressive!

4/8
@naundob
How long are these examples supposed to take? I’m over 12min on A100 and still no result.

5/8
@zaesarius
Oh

6/8
@theinsiderzclub
Looks interesting

Curious to see how MultiTalk performs

7/8
@LouisUn1t
Important context re: agentic shift-AMD + mimik's new edge AI fabric could enably more distributed LLM running. What will hyper-local model execution impact most?

8/8
@Mind_0S
Exciting progress-but how do we extend these agentic capabilities to disconnected, local, or privacy-first contexts beyond centralized infrastructure?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@mesmerlord
being a shorts slop merchant is expensive.... and most of the costs is from the capcut sub(29.99/mo from next week) I had to pay for. the video itself only cost me like $0.3 in FAL api creds and my own runpod pipeline for multitalk

https://video.twimg.com/amplify_video/1946117532294782976/vid/avc1/720x1280/VtESOxE1BMsWmHQP.mp4

2/3
@EsotericCofe
Have you tried Hedra too? Just curious

3/3
@mesmerlord
wayyy back in the day when they were not that good, haven't gone back since maybe they're better than multitalk now?

I just like the option of making all these on my own site atm lol, costs me like 0.02-0.03 per hook vid and I make like 10 and choose best 3 or so

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 18, 2025

1/2
@maxinnerly
A serious leap just happened in AI video production:
MultiTalk, a new open-source model by MeiGen-AI, can generate multi-speaker, lip-synced, animated dialogues with stunning precision. Think crypto Twitter threads turned into ultra-clean, deepfake panels, in 15 seconds flat.

And the kicker? It works with just a 4090 GPU, supports up to 15s videos, and syncs multiple voices to multiple faces… without needing post-editing.

⸻
This isn’t just a tech demo. It’s a direct threat to expensive video shoots, particularly in the crypto marketing industry.

•Launch trailer for your L2? Skip the studio.
•Partner spotlight? Render it with AI faces.
•AMA clip in 7 languages? Done by lunch.
•Explainer for DeFi flows? Generate it with multilingual characters.

MultiTalk solves the last pain point in video generation: accurate, cheap, scalable lip-sync for dialogued content.

The real innovation?
> Label Rotary Position Embedding (L-RoPE) = tracks voice-to-face identity with ridiculous accuracy
> Built for multi-person scenes (unlike Hedra or Runway)
> Works with animated or real faces
> Free on Hugging Face, integrated with ComfyUI

Crypto teams should immediately test use cases across:

- Token utility demos (animated personas)
- Announcements (subtitled in 5+ languages)
- Founder explainers (using AI-generated avatars)
- Community-driven storytelling (UGC turned into visual panels)

Combine with tools like Topaz, Real-ESRGAN, or Luma AI to upscale to 4K and control camera motion.

https://video.twimg.com/amplify_video/1944080317079597060/vid/avc1/850x570/_cgAE7tXmoKYkvsp.mp4

2/2
@maxinnerly
try it out for free: MeiGen-AI/MeiGen-MultiTalk · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

PoorAndDangerous · Jul 18, 2025

What models have you all been using lately? I saw OpenAI released some new features?

bnew · Jul 19, 2025

1/37
@alexwei_
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

2/37
@alexwei_
2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

3/37
@alexwei_
3/N Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).

4/37
@alexwei_
4/N Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.

5/37
@alexwei_
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

6/37
@alexwei_
6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!

7/37
@alexwei_
7/N HUGE congratulations to the team—@SherylHsu02, @polynoamial, and the many giants whose shoulders we stood on—for turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.

8/37
@alexwei_
8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

9/37
@alexwei_
9/N Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.

10/37
@alexwei_
10/N If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model

)

GitHub - aw31/openai-imo-2025-proofs

11/37
@alexwei_
11/N Lastly, we'd like to congratulate all the participants of the 2025 IMO on their achievement! We are proud to have many past IMO participants at @OpenAI and recognize that these are some of the brightest young minds of the future.

12/37
@burny_tech
Soooo what is the breakthrough?
>"Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians."
>"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."

13/37
@burny_tech
so let me get this straight

their model basically competed live on IMO so all the mathematical tasks should be novel enough

all previous years IMO tasks in benchmarks are fully saturated in big part because of data contamination as it doesn't generalize to these new ones

so... this new model seems to... generalize well to novel enough mathematical tasks??? i dont know what to think

14/37
@AlbertQJiang
Congratulations!

15/37
@geo58928
Amazing

16/37
@burny_tech
So public AI models are bad at IMO, while internal models are getting gold medals? Fascinating

17/37
@mhdfaran
@grok who was on second and third

18/37
@QuanquanGu
Congrats, this is incredible results!
Quick question: did it use Lean, or just LLM?
If it’s just LLM… that’s insane.

19/37
@AISafetyMemes
So what's the next goalpost?

What's the next thing LLMs will never be able to do?

20/37
@kimmonismus
Absolutely fantastic

21/37
@CtrlAltDwayne
pretty impressive. is this the anonymous chatbot we're seeing on webdev arena by chance?

22/37
@burny_tech
lmao

23/37
@jack_w_rae
Congratulations! That's an incredible result, and a great moment for AI progress. You guys should release the model

24/37
@Kyrannio
Incredible work.

25/37
@burny_tech
Sweet Bitter lesson

26/37
@burny_tech
"We developed new techniques that make LLMs a lot better at hard-to-verify tasks."
A general method? Or just for mathematical proofs? Is Lean somehow used, maybe just in training?

27/37
@elder_plinius

28/37
@skominers

29/37
@javilopen
Hey @GaryMarcus, what are your thoughts about this?

30/37
@pr0me
crazy feat, congrats!
nice that you have published the data on this

31/37
@danielhanchen
Impressive!

32/37
@IamEmily2050
Congratulations

33/37
@burny_tech
Step towards mathematical superintelligence

34/37
@reach_vb
Massive feat! I love how concise and to the point the generations are unlike majority of LLMs open/ closed alike

35/37
@DCbuild3r
Congratulations!

36/37
@DoctorYev
I just woke up and this post has 1M views after a few hours.

AI does not sleep.

37/37
@AndiBunari1
@grok summarize this and simple to understand

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/10
@polynoamial
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline

[Quoted tweet]
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

2/10
@polynoamial
Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

3/10
@polynoamial
So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.

4/10
@polynoamial
Also this model thinks for a *long* time. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

[Quoted tweet]
@OpenAI's o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots

5/10
@polynoamial
It’s worth reflecting on just how fast AI progress has been, especially in math. In 2024, AI labs were using grade school math (GSM8K) as an eval in their model releases. Since then, we’ve saturated the (high school) MATH benchmark, then AIME, and now are at IMO gold.

6/10
@polynoamial
Where does this go? As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery. There’s a big difference between AI slightly below top human performance vs slightly above.

7/10
@polynoamial
This was a small team effort led by @alexwei_. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community.

8/10
@polynoamial
When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is.

9/10
@posedscaredcity
But yann lec00n says accuracy scales inversely to output length and im sure industry expert gary marcus would agree

10/10
@mrlnonai
will API cost be astronomical for this?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 19, 2025

PoorAndDangerous said:
What models have you all been using lately? I saw OpenAI released some new features?

i'm been testing out kimi K2 and Ernie 4.5 Baidu. ernie 4.5 seems to sometimes forget code in chat without reasoning/thinking mode within three turns in the conversation or possibly when the webpage refreshes.

kimi k2 deep research is impressive but it takes longer than other models.

bnew · Jul 27, 2025

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples

Hierarchical Reasoning Models (HRM) tackle complex reasoning tasks while being smaller, faster, and more data-efficient than large AI models.

venturebeat.com

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples

Ben dikkson@BenDee983

July 25, 2025 4:27 PM

Credit: VentureBeat made with Midjourney

Singapore-based AI startup Sapient Intelligence has developed a new AI architecture that can match, and in some cases vastly outperform, large language models (LLMs) on complex reasoning tasks, all while being significantly smaller and more data-efficient.

The architecture, known as the Hierarchical Reasoning Model (HRM), is inspired by how the human brain utilizes distinct systems for slow, deliberate planning and fast, intuitive computation. The model achieves impressive results with a fraction of the data and memory required by today’s LLMs. This efficiency could have important implications for real-world enterprise AI applications where data is scarce and computational resources are limited.

The limits of chain-of-thought reasoning

When faced with a complex problem, current LLMs largely rely on chain-of-thought (CoT) prompting, breaking down problems into intermediate text-based steps, essentially forcing the model to “think out loud” as it works toward a solution.

While CoT has improved the reasoning abilities of LLMs, it has fundamental limitations. In their paper, researchers at Sapient Intelligence argue that “CoT for reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions where a single misstep or a misorder of the steps can derail the reasoning process entirely.”

This dependency on generating explicit language tethers the model’s reasoning to the token level, often requiring massive amounts of training data and producing long, slow responses. This approach also overlooks the type of “latent reasoning” that occurs internally, without being explicitly articulated in language.

As the researchers note, “A more efficient approach is needed to minimize these data requirements.”

A hierarchical approach inspired by the brain

To move beyond CoT, the researchers explored “latent reasoning,” where instead of generating “thinking tokens,” the model reasons in its internal, abstract representation of the problem. This is more aligned with how humans think; as the paper states, “the brain sustains lengthy, coherent chains of reasoning with remarkable efficiency in a latent space, without constant translation back to language.”

However, achieving this level of deep, internal reasoning in AI is challenging. Simply stacking more layers in a deep learning model often leads to a “vanishing gradient” problem, where learning signals weaken across layers, making training ineffective. An alternative, recurrent architectures that loop over computations can suffer from “early convergence,” where the model settles on a solution too quickly without fully exploring the problem.

The Hierarchical Reasoning Model (HRM) is inspired by the structure of the brain Source: arXiv

Seeking a better approach, the Sapient team turned to neuroscience for a solution. “The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack,” the researchers write. “It organizes computation hierarchically across cortical regions operating at different timescales, enabling deep, multi-stage reasoning.”

Inspired by this, they designed HRM with two coupled, recurrent modules: a high-level (H) module for slow, abstract planning, and a low-level (L) module for fast, detailed computations. This structure enables a process the team calls “hierarchical convergence.” Intuitively, the fast L-module addresses a portion of the problem, executing multiple steps until it reaches a stable, local solution. At that point, the slow H-module takes this result, updates its overall strategy, and gives the L-module a new, refined sub-problem to work on. This effectively resets the L-module, preventing it from getting stuck (early convergence) and allowing the entire system to perform a long sequence of reasoning steps with a lean model architecture that doesn’t suffer from vanishing gradients.

HRM (left) smoothly converges on the solution across computation cycles and avoids early convergence (center, RNNs) and vanishing gradients (right, classic deep neural networks) Source: arXiv

According to the paper, “This process allows the HRM to perform a sequence of distinct, stable, nested computations, where the H-module directs the overall problem-solving strategy and the L-module executes the intensive search or refinement required for each step.” This nested-loop design allows the model to reason deeply in its latent space without needing long CoT prompts or huge amounts of data.

A natural question is whether this “latent reasoning” comes at the cost of interpretability. Guan Wang, Founder and CEO of Sapient Intelligence, pushes back on this idea, explaining that the model’s internal processes can be decoded and visualized, similar to how CoT provides a window into a model’s thinking. He also points out that CoT itself can be misleading. “CoT does not genuinely reflect a model’s internal reasoning,” Wang told VentureBeat, referencing studies showing that models can sometimes yield correct answers with incorrect reasoning steps, and vice versa. “It remains essentially a black box.”

Example of how HRM reasons over a maze problem across different compute cycles Source: arXiv

HRM in action

To test their model, the researchers pitted HRM against benchmarks that require extensive search and backtracking, such as the Abstraction and Reasoning Corpus (ARC-AGI), extremely difficult Sudoku puzzles and complex maze-solving tasks.

The results show that HRM learns to solve problems that are intractable for even advanced LLMs. For instance, on the “Sudoku-Extreme” and “Maze-Hard” benchmarks, state-of-the-art CoT models failed completely, scoring 0% accuracy. In contrast, HRM achieved near-perfect accuracy after being trained on just 1,000 examples for each task.

On the ARC-AGI benchmark, a test of abstract reasoning and generalization, the 27M-parameter HRM scored 40.3%. This surpasses leading CoT-based models like the much larger o3-mini-high (34.5%) and Claude 3.7 Sonnet (21.2%). This performance, achieved without a large pre-training corpus and with very limited data, highlights the power and efficiency of its architecture.

HRM outperforms large models on complex reasoning tasks Source: arXiv

While solving puzzles demonstrates the model’s power, the real-world implications lie in a different class of problems. According to Wang, developers should continue using LLMs for language-based or creative tasks, but for “complex or deterministic tasks,” an HRM-like architecture offers superior performance with fewer hallucinations. He points to “sequential problems requiring complex decision-making or long-term planning,” especially in latency-sensitive fields like embodied AI and robotics, or data-scarce domains like scientific exploration.

In these scenarios, HRM doesn’t just solve problems; it learns to solve them better. “In our Sudoku experiments at the master level… HRM needs progressively fewer steps as training advances—akin to a novice becoming an expert,” Wang explained.

For the enterprise, this is where the architecture’s efficiency translates directly to the bottom line. Instead of the serial, token-by-token generation of CoT, HRM’s parallel processing allows for what Wang estimates could be a “100x speedup in task completion time.” This means lower inference latency and the ability to run powerful reasoning on edge devices.

The cost savings are also substantial. “Specialized reasoning engines such as HRM offer a more promising alternative for specific complex reasoning tasks compared to large, costly, and latency-intensive API-based models,” Wang said. To put the efficiency into perspective, he noted that training the model for professional-level Sudoku takes roughly two GPU hours, and for the complex ARC-AGI benchmark, between 50 and 200 GPU hours—a fraction of the resources needed for massive foundation models. This opens a path to solving specialized business problems, from logistics optimization to complex system diagnostics, where both data and budget are finite.

Looking ahead, Sapient Intelligence is already working to evolve HRM from a specialized problem-solver into a more general-purpose reasoning module. “We are actively developing brain-inspired models built upon HRM,” Wang said, highlighting promising initial results in healthcare, climate forecasting, and robotics. He teased that these next-generation models will differ significantly from today’s text-based systems, notably through the inclusion of self-correcting capabilities.

The work suggests that for a class of problems that have stumped today’s AI giants, the path forward may not be bigger models, but smarter, more structured architectures inspired by the ultimate reasoning engine: the human brain.

bnew · Sep 1, 2025

AI can't solve these puzzles that take humans only seconds

Discover why some puzzles stump supersmart AIs but are easy for humans, what this reveals about the quest for true artificial general intelligence — and why video games are the next frontier.

www.livescience.com

AI can't solve these puzzles that take humans only seconds

Interviews

By Deni Ellis Béchard published 20 hours ago

Discover why some puzzles stump supersmart AIs but are easy for humans, what this reveals about the quest for true artificial general intelligence — and why video games are the next frontier.

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

an illustration of a brain in a pixelated mosaic style

(Image credit: Flavio Coelho via Getty Images)

There are many ways to test the intelligence of an artificial intelligence — conversational fluidity, reading comprehension or mind-bendingly difficult physics. But some of the tests that are most likely to stump AIs are ones that humans find relatively easy, even entertaining. Though AIs increasingly excel at tasks that require high levels of human expertise, this does not mean that they are close to attaining artificial general intelligence, or AGI. AGI requires that an AI can take a very small amount of information and use it to generalize and adapt to highly novel situations. This ability, which is the basis for human learning, remains challenging for AIs.

One test designed to evaluate an AI's ability to generalize is the Abstraction and Reasoning Corpus, or ARC: a collection of tiny, colored-grid puzzles that ask a solver to deduce a hidden rule and then apply it to a new grid. Developed by AI researcher François Chollet in 2019, it became the basis of the ARC Prize Foundation, a nonprofit program that administers the test — now an industry benchmark used by all major AI models. The organization also develops new tests and has been routinely using two (ARC-AGI-1 and its more challenging successor ARC-AGI-2). This week the foundation is launching ARC-AGI-3, which is specifically designed for testing AI agents — and is based on making them play video games.

Scientific American spoke to ARC Prize Foundation president, AI researcher and entrepreneur Greg Kamradt to understand how these tests evaluate AIs, what they tell us about the potential for AGI and why they are often challenging for deep-learning models even though many humans tend to find them relatively easy. Links to try the tests are at the end of the article.

You may like

[An edited transcript of the interview follows.]

What definition of intelligence is measured by ARC-AGI-1?

Our definition of intelligence is your ability to learn new things. We already know that AI can win at chess. We know they can beat Go. But those models cannot generalize to new domains; they can't go and learn English. So what François Chollet made was a benchmark called ARC-AGI — it teaches you a mini skill in the question, and then it asks you to demonstrate that mini skill. We're basically teaching something and asking you to repeat the skill that you just learned. So the test measures a model's ability to learn within a narrow domain. But our claim is that it does not measure AGI because it's still in a scoped domain [in which learning applies to only a limited area]. It measures that an AI can generalize, but we do not claim this is AGI.

How are you defining AGI here?

There are two ways I look at it. The first is more tech-forward, which is 'Can an artificial system match the learning efficiency of a human?' Now what I mean by that is after humans are born, they learn a lot outside their training data. In fact, they don't really have training data, other than a few evolutionary priors. So we learn how to speak English, we learn how to drive a car, and we learn how to ride a bike — all these things outside our training data. That's called generalization. When you can do things outside of what you've been trained on now, we define that as intelligence. Now, an alternative definition of AGI that we use is when we can no longer come up with problems that humans can do and AI cannot — that's when we have AGI. That's an observational definition. The flip side is also true, which is as long as the ARC Prize or humanity in general can still find problems that humans can do but AI cannot, then we do not have AGI. One of the key factors about François Chollet's benchmark... is that we test humans on them, and the average human can do these tasks and these problems, but AI still has a really hard time with it. The reason that's so interesting is that some advanced AIs, such as Grok, can pass any graduate-level exam or do all these crazy things, but that's spiky intelligence. It still doesn't have the generalization power of a human. And that's what this benchmark shows.

How do your benchmarks differ from those used by other organizations?

One of the things that differentiates us is that we require that our benchmark to be solvable by humans. That's in opposition to other benchmarks, where they do "Ph.D.-plus-plus" problems. I don't need to be told that AI is smarter than me — I already know that OpenAI's o3 can do a lot of things better than me, but it doesn't have a human's power to generalize. That's what we measure on, so we need to test humans. We actually tested 400 people on ARC-AGI-2. We got them in a room, we gave them computers, we did demographic screening, and then gave them the test. The average person scored 66 percent on ARC-AGI-2. Collectively, though, the aggregated responses of five to 10 people will contain the correct answers to all the questions on the ARC2.

What makes this test hard for AI and relatively easy for humans?

There are two things. Humans are incredibly sample-efficient with their learning, meaning they can look at a problem and with maybe one or two examples, they can pick up the mini skill or transformation and they can go and do it. The algorithm that's running in a human's head is orders of magnitude better and more efficient than what we're seeing with AI right now.

What is the difference between ARC-AGI-1 and ARC-AGI-2?

So ARC-AGI-1, François Chollet made that himself. It was about 1,000 tasks. That was in 2019. He basically did the minimum viable version in order to measure generalization, and it held for five years because deep learning couldn't touch it at all. It wasn't even getting close. Then reasoning models that came out in 2024, by OpenAI, started making progress on it, which showed a step-level change in what AI could do. Then, when we went to ARC-AGI-2, we went a little bit further down the rabbit hole in regard to what humans can do and AI cannot. It requires a little bit more planning for each task. So instead of getting solved within five seconds, humans may be able to do it in a minute or two. There are more complicated rules, and the grids are larger, so you have to be more precise with your answer, but it's the same concept, more or less.... We are now launching a developer preview for ARC-AGI-3, and that's completely departing from this format. The new format will actually be interactive. So think of it more as an agent benchmark.

How will ARC-AGI-3 test agents differently compared with previous tests?

If you think about everyday life, it's rare that we have a stateless decision. When I say stateless, I mean just a question and an answer. Right now all benchmarks are more or less stateless benchmarks. If you ask a language model a question, it gives you a single answer. There's a lot that you cannot test with a stateless benchmark. You cannot test planning. You cannot test exploration. You cannot test intuiting about your environment or the goals that come with that. So we're making 100 novel video games that we will use to test humans to make sure that humans can do them because that's the basis for our benchmark. And then we're going to drop AIs into these video games and see if they can understand this environment that they've never seen beforehand. To date, with our internal testing, we haven't had a single AI be able to beat even one level of one of the games.

Can you describe the video games here?

Each "environment," or video game, is a two-dimensional, pixel-based puzzle. These games are structured as distinct levels, each designed to teach a specific mini skill to the player (human or AI). To successfully complete a level, the player must demonstrate mastery of that skill by executing planned sequences of actions.

How is using video games to test for AGI different from the ways that video games have previously been used to test AI systems?

Video games have long been used as benchmarks in AI research, with Atari games being a popular example. But traditional video game benchmarks face several limitations. Popular games have extensive training data publicly available, lack standardized performance evaluation metrics and permit brute-force methods involving billions of simulations. Additionally, the developers building AI agents typically have prior knowledge of these games — unintentionally embedding their own insights into the solutions.

Try ARC-AGI-1, ARC-AGI-2 and ARC-AGI-3.

This article was first published at Scientific American. © ScientificAmerican.com.

bnew · Sep 30, 2025

1/37
@claudeai
Introducing Claude Sonnet 4.5—the best coding model in the world.

It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

2/37
@claudeai
We're also releasing upgrades for Claude Code.

The terminal interface has a fresh new look, and the new VS Code extension brings Claude to your IDE.

The new checkpoints feature lets you confidently run large tasks and roll back instantly to a previous state, if needed.

https://video.twimg.com/amplify_video/1972704021744889856/vid/avc1/1920x1080/70YI1lNrTXd5sUk9.mp4

3/37
@claudeai
Claude can use code to analyze data, create files, and visualize insights in the files & formats you use. Now available to all paid plans in preview.

We've also made the Claude for Chrome extension available to everyone who joined the waitlist last month.

https://video.twimg.com/amplify_video/1972704365128347648/vid/avc1/3840x2160/alQXsx9zZ9c_BvIQ.mp4

4/37
@claudeai
On the Claude API, we’ve added two new capabilities to build agents that handle long-running tasks without frequently hitting context limits:

- Context editing to automatically clear stale context
- The memory tool to store and consult information outside the context window

https://video.twimg.com/amplify_video/1972704629088501760/vid/avc1/3840x2160/9b9CBcuxW0Sz4n4K.mp4

5/37
@claudeai
Claude Sonnet 4.5 is available everywhere today—on the Claude Developer Platform, natively and in Amazon Bedrock and Google Cloud's Vertex AI.

Pricing remains the same as Sonnet 4.

For more details: Introducing Claude Sonnet 4.5

6/37
@claudeai
We're also releasing a temporary research preview called "Imagine with Claude".

In this experiment, Claude generates software on the fly. No functionality is predetermined; no code is prewritten.

Available to Max users for 5 days. Try it out: https://claude.ai/imagine

7/37
@nickstinemates
Paired with @thesysteminit it solved a 503 error in our app in 15 minutes that took 2+ hours to debug manually.

Pretty good!

I wrote about it here: Using Claude Sonnet 4.5 with System Initiative

8/37
@sckimynwa
looks like we're entering the era where agents can actually build agents. curious how long until sonnet 4.5 can autonomously contribute to its own training pipeline

9/37
@theomarcu
We've been using Sonnet 4.5 in Devin this week. Some observations: the model is aware of its context window (and this affects behavior), actively creates feedback loops to verify its work, and executes operations in parallel. Notes on what we learned:

Cognition | Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges

10/37
@LouiseBeattie
Not the best, SWE Bench 10% behind the best - @ridges_ai …

11/37
@leonho
Just added Sonnet 4.5 support to AgentUse

Been testing it out and the reasoning improvements really shine when building agentic workflows. Makes the agent logic much more reliable.

GitHub - agentuse/agentuse:

Write and Run AI Agents with Markdown. Run automated AI agents with ease.

12/37
@danshipper
GREAT MODEL

we've been testing for a few days, here's our vibe check: Vibe Check: Claude Sonnet 4.5

13/37
@GustavoValverde
But it still answers with: You're absolutely right!

14/37
@airesearch12
Wow, Sonnet 4.5 winning over GPT-5 by such a wide margin is unexpected.
How will Opus 4.5 perform?

15/37
@piet_dev
Here we go again

16/37
@DeeperThrill

17/37
@prpatel05
Claude devops team ready for traffic

18/37
@bentossell
in droid now

https://video.twimg.com/amplify_video/1972711802661122048/vid/avc1/1920x1080/2Sie_nt9zaTREYsg.mp4

19/37
@AliDTwitt
What's the max context window for non enterprise users?

20/37
@spyced
Here's how Sonnet 4.5 does at writing Java code: Brokk – AI for Large Codebases

21/37
@wesselvk
@Grok can you compare yourself against these stats?

22/37
@Yuchenj_UW
You trained a beast, folks.

[Quoted tweet]
Claude Sonnet 4.5 runs autonomously for 30+ hours of coding?!

The record for GPT-5-Codex was just 7 hours.

What’s Anthropic’s secret sauce?

23/37
@buildooor
if they're lucky

24/37
@AskChief_Leo
The 61.4% score on OSWorld benchmark is impressive! Have you had a chance to test its coding capabilities on any specific frameworks? I'm particularly curious about its performance with complex refactoring tasks.

25/37
@DrSohaibQadri
So basically I'm getting better than Opus performance for the same price as Sonnet.

What a day. Let's build

26/37
@_akhaliq
Available in anycoder: Anycoder - a Hugging Face Space by akhaliq

27/37
@ozgrozer
Could have better

[Quoted tweet]
Tried the same 3D city prompt on Sonnet 4.5 Thinking but even Sonnet 3.7 was better. They somehow made it worse.

https://video.twimg.com/amplify_video/1972761169040474112/vid/avc1/1600x1080/huWbEIZeHDIsmgX9.mp4

28/37
@alexhavryleshko
Overall comparison with GPT-5

29/37
@ApollonVisual

30/37
@MichaelFerro
Huge!

Chief Product Officer @MikeyK will be joining @TBPN today to discuss:

[Quoted tweet]
BREAKING: @AnthropicAI launches Claude Sonnet 4.5, the world’s leading coding model.

State-of-the-art on SWE-Bench Verified, with autonomous runs extended from 7 to 30 hours.

Chief Product Officer @MikeyK joins us at 11:45am PT to discuss the announcement.

31/37
@AparupGanguly01
Been using it to build projects with @hyperbrowser and it’s crazy good!

https://video.twimg.com/amplify_video/1972760545322508288/vid/avc1/1460x1080/RTxFdnj0XyFpZld1.mp4

32/37
@jordi_cor
Great! I'm ready to use another 1 billion tokens to finish my app using Claude Code!

33/37
@AZLabs_AI
Curious if those 30+ hour autonomous runs could shift how teams think about code ownership. Fewer handoffs between devs, or does debugging autonomous work create new bottlenecks?

34/37
@Chris_Brannigan
Terrific release

lots to review but the 30 hour complex task focus really jumps out. Reliability drives enterprise adoption and that would be an inflection point for agentic ai doing productive work way earlier than forecast

35/37
@roo7cause
So if I understand this correctly, we should be using Sonnet 4.5 instead of Opus 4.1 in our Claude Projects that are coding related? Or is Opus still better with larger context windows?

36/37
@PrimeSontiac
AI keeps improving. Is there no wall?

37/37
@Hrshgdkr

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free​

David meets Goliath: How Kimi K2 outperforms Silicon Valley’s billion-dollar models​

The MuonClip breakthrough: Why this optimizer could reshape AI training economics​

Open source as competitive weapon: Moonshot’s radical pricing strategy targets big tech’s profit centers​

From demo to reality: Why Kimi K2’s agent capabilities signal the end of chatbot theater​

The great convergence: When open source models finally caught the leaders​

Veteran

Veteran

Superstar

Veteran

Veteran

Veteran

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples​

The limits of chain-of-thought reasoning​

A hierarchical approach inspired by the brain​

HRM in action​

Veteran

AI can't solve these puzzles that take humans only seconds​

What definition of intelligence is measured by ARC-AGI-1?​

How are you defining AGI here?​

How do your benchmarks differ from those used by other organizations?​

What makes this test hard for AI and relatively easy for humans?​

What is the difference between ARC-AGI-1 and ARC-AGI-2?​

How will ARC-AGI-3 test agents differently compared with previous tests?​

Can you describe the video games here?​

How is using video games to test for AGI different from the ways that video games have previously been used to test AI systems?​

Veteran

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free

David meets Goliath: How Kimi K2 outperforms Silicon Valley’s billion-dollar models

The MuonClip breakthrough: Why this optimizer could reshape AI training economics

Open source as competitive weapon: Moonshot’s radical pricing strategy targets big tech’s profit centers

From demo to reality: Why Kimi K2’s agent capabilities signal the end of chatbot theater

The great convergence: When open source models finally caught the leaders

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples

The limits of chain-of-thought reasoning

A hierarchical approach inspired by the brain

HRM in action

AI can't solve these puzzles that take humans only seconds

What definition of intelligence is measured by ARC-AGI-1?

How are you defining AGI here?

How do your benchmarks differ from those used by other organizations?

What makes this test hard for AI and relatively easy for humans?

What is the difference between ARC-AGI-1 and ARC-AGI-2?

How will ARC-AGI-3 test agents differently compared with previous tests?

Can you describe the video games here?

How is using video games to test for AGI different from the ways that video games have previously been used to test AI systems?