bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437











1/37
@alexwei_
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).



GwLl5lhXIAAXl5p.jpg


2/37
@alexwei_
2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.



GwLmFYaW8AAvAA1.png


3/37
@alexwei_
3/N Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).



4/37
@alexwei_
4/N Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.



GwLm5SCbIAAfkMF.png

GwLtrPeWIAUMDYI.png

GwLuozlWcAApPEc.png

GwLuqR1XQAAdSl6.png


5/37
@alexwei_
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.



6/37
@alexwei_
6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold! 🥇



7/37
@alexwei_
7/N HUGE congratulations to the team—@SherylHsu02, @polynoamial, and the many giants whose shoulders we stood on—for turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.



8/37
@alexwei_
8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.



9/37
@alexwei_
9/N Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.



GwLv06_bwAAZbsl.jpg


10/37
@alexwei_
10/N If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model 😅)

GitHub - aw31/openai-imo-2025-proofs



11/37
@alexwei_
11/N Lastly, we'd like to congratulate all the participants of the 2025 IMO on their achievement! We are proud to have many past IMO participants at @OpenAI and recognize that these are some of the brightest young minds of the future.



12/37
@burny_tech
Soooo what is the breakthrough?
>"Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians."
>"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."



GwNe8zWXUAARRMA.jpg


13/37
@burny_tech
so let me get this straight

their model basically competed live on IMO so all the mathematical tasks should be novel enough

all previous years IMO tasks in benchmarks are fully saturated in big part because of data contamination as it doesn't generalize to these new ones

so... this new model seems to... generalize well to novel enough mathematical tasks??? i dont know what to think



14/37
@AlbertQJiang
Congratulations!



15/37
@geo58928
Amazing



16/37
@burny_tech
So public AI models are bad at IMO, while internal models are getting gold medals? Fascinating



GwNY40YXYAEUP0W.jpg


17/37
@mhdfaran
@grok who was on second and third



18/37
@QuanquanGu
Congrats, this is incredible results!
Quick question: did it use Lean, or just LLM?
If it’s just LLM… that’s insane.



19/37
@AISafetyMemes
So what's the next goalpost?

What's the next thing LLMs will never be able to do?



20/37
@kimmonismus
Absolutely fantastic



21/37
@CtrlAltDwayne
pretty impressive. is this the anonymous chatbot we're seeing on webdev arena by chance?



22/37
@burny_tech
lmao



GwNf8atXAAEAgAF.jpg


23/37
@jack_w_rae
Congratulations! That's an incredible result, and a great moment for AI progress. You guys should release the model



24/37
@Kyrannio
Incredible work.



25/37
@burny_tech
Sweet Bitter lesson



GwNofKJX0AAknI6.png


26/37
@burny_tech
"We developed new techniques that make LLMs a lot better at hard-to-verify tasks."
A general method? Or just for mathematical proofs? Is Lean somehow used, maybe just in training?



27/37
@elder_plinius




28/37
@skominers
🙌🙌🙌🙌🙌🙌



29/37
@javilopen
Hey @GaryMarcus, what are your thoughts about this?



30/37
@pr0me
crazy feat, congrats!
nice that you have published the data on this



31/37
@danielhanchen
Impressive!



32/37
@IamEmily2050
Congratulations 🙏



33/37
@burny_tech
Step towards mathematical superintelligence



34/37
@reach_vb
Massive feat! I love how concise and to the point the generations are unlike majority of LLMs open/ closed alike 😁



35/37
@DCbuild3r
Congratulations!



36/37
@DoctorYev
I just woke up and this post has 1M views after a few hours.

AI does not sleep.



37/37
@AndiBunari1
@grok summarize this and simple to understand




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196










1/10
@polynoamial
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵

[Quoted tweet]
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).


GwLl5lhXIAAXl5p.jpg


2/10
@polynoamial
Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.



3/10
@polynoamial
So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.



4/10
@polynoamial
Also this model thinks for a *long* time. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

[Quoted tweet]
@OpenAI's o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots


GXSs0RuWkAElX2T.png


5/10
@polynoamial
It’s worth reflecting on just how fast AI progress has been, especially in math. In 2024, AI labs were using grade school math (GSM8K) as an eval in their model releases. Since then, we’ve saturated the (high school) MATH benchmark, then AIME, and now are at IMO gold.



6/10
@polynoamial
Where does this go? As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery. There’s a big difference between AI slightly below top human performance vs slightly above.



7/10
@polynoamial
This was a small team effort led by @alexwei_. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community.



8/10
@polynoamial
When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is.



9/10
@posedscaredcity
But yann lec00n says accuracy scales inversely to output length and im sure industry expert gary marcus would agree



10/10
@mrlnonai
will API cost be astronomical for this?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

1/9
@AlibabaGroup
Unleash your creativity with Wan2.1-VACE! Edit your video by customizing subjects, backgrounds, or endings in your own unique way.🪄

/search?q=#AlibabaAI /search?q=#Innovation /search?q=#Wan /search?q=#OpenSource



https://video.twimg.com/amplify_video/1937720941687214080/vid/avc1/1920x1080/bVjrM-eA1XY9Qq4N.mp4

2/9
@mottagio1971
gm Alibaba



3/9
@msAa123456
Finally, an AI that lets me edit out my awkward pauses. /search?q=#PublicAI coming in clutch for my nonexistent influencer career.



4/9
@Sjmousavi5
Warning to everyone: Alibaba is being dishonest and fraudulent in handling orders! I’ve faced serious issues with order #246262627001024259, and despite proving that I never received the item, they still refused to offer a proper refund or solution. Beware before you buy!



5/9
@Janakraaj62
Nice baby



6/9
@cheuk_baby
深度好文老师分析的很到位



7/9
@cheuk_baby
分析得很到位



8/9
@cheuk_baby
完全同意 这里总能遇到有趣的想法



9/9
@WScraidy
This is fukking pathetic. Did you steal this technically from the Green hats? What will you do when people find out what anderachrome is? fukking losers




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196












1/10
@itsjasonai
RIP Sora

China's done it again! Meet VACE, their latest open-source video generator.

Check out these 7 wild examples:



2/10
@itsjasonai
1. Move-Anything:

Example: A young boy rises from his chair and walks briskly to the right side of the frame towards the edge of the sun-drenched frame, as if chasing a new adventure.



https://video.twimg.com/amplify_video/1900849525650321408/vid/avc1/1280x720/eiyWB2cLyNCEilFY.mp4

3/10
@itsjasonai
2. Video Rerender

VACE can perform video re-render, including content preservation, structure preservation, subject preservation, posture preservation, and motion preservation, etc.



https://video.twimg.com/amplify_video/1900849550094749696/vid/avc1/1280x720/Nfn0GdFTjoVds8sN.mp4

4/10
@itsjasonai
3. All-in-One Video Creation and Editing Provides solutions for video generation and editing within a single model.



https://video.twimg.com/amplify_video/1900849588703272960/vid/avc1/918x540/eak4OBGsjC5OMbwS.mp4

5/10
@itsjasonai
4. Composite Anything



https://video.twimg.com/amplify_video/1900849624258408450/vid/avc1/1280x720/3GLD4gJILz71dQjD.mp4

6/10
@itsjasonai
5. Animate-Anything



https://video.twimg.com/amplify_video/1900849648274952192/vid/avc1/1280x720/ATWuzaKlS2VMavUZ.mp4

7/10
@itsjasonai
6.



https://video.twimg.com/amplify_video/1900849671612043264/vid/avc1/1280x720/M5bjsqQKR-2EGXxp.mp4

8/10
@itsjasonai
7.



https://video.twimg.com/amplify_video/1900849699864932353/vid/avc1/1280x720/UkaY1_dSW56bucEg.mp4

9/10
@itsjasonai
Paper: Paper page - VACE: All-in-One Video Creation and Editing
Code: GitHub - ali-vilab/VACE: Official implementations for paper: VACE: All-in-One Video Creation and Editing



10/10
@itsjasonai
Stay ahead in AI with just 5 minutes a day!

Join The AI Daily for FREE and get the latest updates.

Plus, grab your AI Starter Pack for FREE today!

The AI Daily



GmEsw_0aYAAsq8E.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/6
@AIWarper
Kijai converted the CausVideo into a lora.

The TL:DR here is this is just 4 steps, 81 frames, 576 x 1024

Prompt executed in 198.92 seconds

To put that into perspective, my previous example of this (using 1.3bn VACE) took 10-15mins to do 81 frames.

MORE DETAILS BELOW 👇



https://video.twimg.com/amplify_video/1923068358737739776/vid/avc1/1818x1080/ldBw1OLXQOQWkdUn.mp4

2/6
@AIWarper
You need to use the 14BN T2V model

Wan2_1-T2V-14B_fp8_e4m3fn.safetensors · Kijai/WanVideo_comfy at main

You need to use the 14BN VACE model
Wan2_1-VACE_module_14B_fp8_e4m3fn.safetensors · Kijai/WanVideo_comfy at main

LORA
Wan21_CausVid_14B_T2V_lora_rank32.safetensors · Kijai/WanVideo_comfy at main



GrAdqSoXwAAWpQz.jpg


3/6
@joesparks
Can you point us to a workflow? I don't recognize some of the custom nodes. Thank you!



4/6
@spiritform
workflow?



5/6
@SubarcticRec
Have you shared your vace workflows? I'm not getting very good results with mine. Skill issue.



6/6
@m_ai_kel
Will it work with 16gb vram ?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/3
@victormustar
🔥 New drop Wan2.1: excels in Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, advancing the field of video generation 🧑‍🎤

Wan-AI/Wan2.1-VACE-14B · Hugging Face



2/3
@Hyperstackcloud
Super cool 👏



3/3
@abdiisan
@FAL




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/4
@AdinaYakup
Wan2.1-VACE 🔥 open video generation models by @Alibaba_Wan

Wan-AI/Wan2.1-VACE-1.3B · Hugging Face
Wan-AI/Wan2.1-VACE-14B · Hugging Face

✨ 1.3B/14B with Apache2.0
✨ Supports Text-to-Video, Image-to-Video, and more
✨ Supports Chinese & English
✨ Smooth 1080P encoding with powerful VAE



2/4
@Hyperstackcloud
So cool! 🙌



3/4
@Coolzippity
What is vace?



4/4
@xxzengyibuke
应该是720p?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/2
@toyxyz3
Wan 2.1 Vace pose Interpolation test /search?q=#AI /search?q=#AIイラスト /search?q=#comfyui



https://video.twimg.com/amplify_video/1926301237734940672/vid/avc1/1424x1492/6QR3FTuDZbj5AFh8.mp4

2/2
@IamEmily2050
Impressive 👌




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/2
@toyxyz3
Wan 2.1 Vace pose Interpolation test /search?q=#AI /search?q=#AIイラスト /search?q=#comfyui



https://video.twimg.com/amplify_video/1926376575315873792/vid/avc1/1432x1498/-OLi067b0dU_a7A6.mp4

2/2
@peteromallet
@RyanOnTheInside @yvann_ba

Think of the audioreactivity potential!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/5
@toyxyz3
Wan 2.1 Vace pose Interpolation test /search?q=#AI /search?q=#AIイラスト /search?q=#comfyui



https://video.twimg.com/amplify_video/1926382584885305344/vid/avc1/1192x2084/sua4iSyMxHOJjRAH.mp4

2/5
@studio_galt
Are there any guide on setting this up? Very new to comfy ui ecosystem.



3/5
@toyxyz3
You can start with kijai's wan video custom node and use wan vace and open pose interpolation. https://github.com/kijai/ComfyUI-WanVideoWrapper



4/5
@PurzBeats
Man interpolating OpenPose is the move!!! I only tried this once and had bad results, yours are great, I need to revisit this!



5/5
@IamEmily2050
Can it do a 5-second complex face dance 👀




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/4
@toyxyz3
Wan 2.1 Vace pose Interpolation test /search?q=#AI /search?q=#AIイラスト /search?q=#comfyui



https://video.twimg.com/amplify_video/1926391268411592704/vid/avc1/1584x1724/owscOczcGA-OtGo4.mp4

2/4
@nicolaiklemke
And this is how to make text2video audioreactive



3/4
@creator_kachun
Great! Very creative use case!



4/4
@TRADX_XDART




https://video.twimg.com/amplify_video/1929569568214925312/vid/avc1/720x1280/ES15EG8gKKfvEKI7.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples​


Ben dikkson@BenDee983

July 25, 2025 4:27 PM

Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney



Singapore-based AI startup Sapient Intelligence has developed a new AI architecture that can match, and in some cases vastly outperform, large language models (LLMs) on complex reasoning tasks, all while being significantly smaller and more data-efficient.

The architecture, known as the Hierarchical Reasoning Model (HRM), is inspired by how the human brain utilizes distinct systems for slow, deliberate planning and fast, intuitive computation. The model achieves impressive results with a fraction of the data and memory required by today’s LLMs. This efficiency could have important implications for real-world enterprise AI applications where data is scarce and computational resources are limited.

The limits of chain-of-thought reasoning​


When faced with a complex problem, current LLMs largely rely on chain-of-thought (CoT) prompting, breaking down problems into intermediate text-based steps, essentially forcing the model to “think out loud” as it works toward a solution.

While CoT has improved the reasoning abilities of LLMs, it has fundamental limitations. In their paper, researchers at Sapient Intelligence argue that “CoT for reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions where a single misstep or a misorder of the steps can derail the reasoning process entirely.”


This dependency on generating explicit language tethers the model’s reasoning to the token level, often requiring massive amounts of training data and producing long, slow responses. This approach also overlooks the type of “latent reasoning” that occurs internally, without being explicitly articulated in language.

As the researchers note, “A more efficient approach is needed to minimize these data requirements.”

A hierarchical approach inspired by the brain​


To move beyond CoT, the researchers explored “latent reasoning,” where instead of generating “thinking tokens,” the model reasons in its internal, abstract representation of the problem. This is more aligned with how humans think; as the paper states, “the brain sustains lengthy, coherent chains of reasoning with remarkable efficiency in a latent space, without constant translation back to language.”

However, achieving this level of deep, internal reasoning in AI is challenging. Simply stacking more layers in a deep learning model often leads to a “vanishing gradient” problem, where learning signals weaken across layers, making training ineffective. An alternative, recurrent architectures that loop over computations can suffer from “early convergence,” where the model settles on a solution too quickly without fully exploring the problem.

hierarchical reasoning model
The Hierarchical Reasoning Model (HRM) is inspired by the structure of the brain Source: arXiv

Seeking a better approach, the Sapient team turned to neuroscience for a solution. “The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack,” the researchers write. “It organizes computation hierarchically across cortical regions operating at different timescales, enabling deep, multi-stage reasoning.”

Inspired by this, they designed HRM with two coupled, recurrent modules: a high-level (H) module for slow, abstract planning, and a low-level (L) module for fast, detailed computations. This structure enables a process the team calls “hierarchical convergence.” Intuitively, the fast L-module addresses a portion of the problem, executing multiple steps until it reaches a stable, local solution. At that point, the slow H-module takes this result, updates its overall strategy, and gives the L-module a new, refined sub-problem to work on. This effectively resets the L-module, preventing it from getting stuck (early convergence) and allowing the entire system to perform a long sequence of reasoning steps with a lean model architecture that doesn’t suffer from vanishing gradients.

image_c096bf.png
HRM (left) smoothly converges on the solution across computation cycles and avoids early convergence (center, RNNs) and vanishing gradients (right, classic deep neural networks) Source: arXiv

According to the paper, “This process allows the HRM to perform a sequence of distinct, stable, nested computations, where the H-module directs the overall problem-solving strategy and the L-module executes the intensive search or refinement required for each step.” This nested-loop design allows the model to reason deeply in its latent space without needing long CoT prompts or huge amounts of data.

A natural question is whether this “latent reasoning” comes at the cost of interpretability. Guan Wang, Founder and CEO of Sapient Intelligence, pushes back on this idea, explaining that the model’s internal processes can be decoded and visualized, similar to how CoT provides a window into a model’s thinking. He also points out that CoT itself can be misleading. “CoT does not genuinely reflect a model’s internal reasoning,” Wang told VentureBeat, referencing studies showing that models can sometimes yield correct answers with incorrect reasoning steps, and vice versa. “It remains essentially a black box.”

image_fa955c.png
Example of how HRM reasons over a maze problem across different compute cycles Source: arXiv



HRM in action​


To test their model, the researchers pitted HRM against benchmarks that require extensive search and backtracking, such as the Abstraction and Reasoning Corpus (ARC-AGI), extremely difficult Sudoku puzzles and complex maze-solving tasks.

The results show that HRM learns to solve problems that are intractable for even advanced LLMs. For instance, on the “Sudoku-Extreme” and “Maze-Hard” benchmarks, state-of-the-art CoT models failed completely, scoring 0% accuracy. In contrast, HRM achieved near-perfect accuracy after being trained on just 1,000 examples for each task.

On the ARC-AGI benchmark, a test of abstract reasoning and generalization, the 27M-parameter HRM scored 40.3%. This surpasses leading CoT-based models like the much larger o3-mini-high (34.5%) and Claude 3.7 Sonnet (21.2%). This performance, achieved without a large pre-training corpus and with very limited data, highlights the power and efficiency of its architecture.

image_95e232.png
HRM outperforms large models on complex reasoning tasks Source: arXiv

While solving puzzles demonstrates the model’s power, the real-world implications lie in a different class of problems. According to Wang, developers should continue using LLMs for language-based or creative tasks, but for “complex or deterministic tasks,” an HRM-like architecture offers superior performance with fewer hallucinations. He points to “sequential problems requiring complex decision-making or long-term planning,” especially in latency-sensitive fields like embodied AI and robotics, or data-scarce domains like scientific exploration.

In these scenarios, HRM doesn’t just solve problems; it learns to solve them better. “In our Sudoku experiments at the master level… HRM needs progressively fewer steps as training advances—akin to a novice becoming an expert,” Wang explained.

For the enterprise, this is where the architecture’s efficiency translates directly to the bottom line. Instead of the serial, token-by-token generation of CoT, HRM’s parallel processing allows for what Wang estimates could be a “100x speedup in task completion time.” This means lower inference latency and the ability to run powerful reasoning on edge devices.

The cost savings are also substantial. “Specialized reasoning engines such as HRM offer a more promising alternative for specific complex reasoning tasks compared to large, costly, and latency-intensive API-based models,” Wang said. To put the efficiency into perspective, he noted that training the model for professional-level Sudoku takes roughly two GPU hours, and for the complex ARC-AGI benchmark, between 50 and 200 GPU hours—a fraction of the resources needed for massive foundation models. This opens a path to solving specialized business problems, from logistics optimization to complex system diagnostics, where both data and budget are finite.

Looking ahead, Sapient Intelligence is already working to evolve HRM from a specialized problem-solver into a more general-purpose reasoning module. “We are actively developing brain-inspired models built upon HRM,” Wang said, highlighting promising initial results in healthcare, climate forecasting, and robotics. He teased that these next-generation models will differ significantly from today’s text-based systems, notably through the inclusion of self-correcting capabilities.

The work suggests that for a class of problems that have stumped today’s AI giants, the path forward may not be bigger models, but smarter, more structured architectures inspired by the ultimate reasoning engine: the human brain.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

Zuckervision

Jul 31, 3:12 PM EDT by Victor Tangermann



There's a Very Basic Flaw in Mark Zuckerberg's Plan for Superintelligent AI​




"Just entirely devoid of ambition and imagination."​


/ Artificial Intelligence/ Artificial Intelligence/ Facebook/ Mark Zuckerberg

Getty / Futurism


Image by Getty / Futurism

This week, Meta CEO Mark Zuckerberg shared his vision for the future of AI, a "personal intelligence" that can help you "achieve your goals, create what you want to see in the world, experience any adventure, be a better friend to those you care about, and grow to become the person you aspire to be."

The hazy announcement — which lacked virtually any degree of detail and smacked of the uninspired output of an AI chatbot — painted a rosy picture of a future where everybody uses our "newfound productivity to achieve more than was previously possible."

Zuckerberg couched it all in a humanist wrapper: instead of "automating all valuable work" like Meta's competitors in the AI space, which would result in humanity living "on a dole of its output," Zuckerberg argued that his "personal superintelligence" would put "power in people's hands to direct it towards what they value in their own lives."

But it's hard not to see the billionaire trying to have it both ways. Zuckerberg is dreaming up a utopia in which superintelligent AIs benevolently stop short of taking over everybody's jobs, instead just augmenting our lives in profound ways.

The problem? Well, basic reality, for starters: if you offer a truly superintelligent AI to the masses, the powerful are going to use it to automate other people's jobs. If you somehow force your AI not to do that, your competitors will.

As former OpenAI safety researcher Steven Adler pointed out on X-formerly-Twitter, "Mark seems to think it's important whether Meta *directs* superintelligence toward mass automation of work."

"This is not correct," he added."If you 'bring personal superintelligence to everyone' (including business-owners), they will personally choose to automate others' work, if they can."

Adler left OpenAI earlier this year, tweeting at the time that he was "pretty terrified by the pace of AI development these days."

"IMO, an AGI race is a very risky gamble, with huge downside," he added, referring to OpenAI CEO Sam Altman's quest for "artificial general intelligence," a poorly-defined point at which the capabilities of AIs would surpass those of humans. "No lab has a solution to AI alignment today. And the faster we race, the less likely that anyone finds one in time."

Adler saw plenty of parallels between his former employer's approach and Zuckerberg's.

"This is like when OpenAI said they are only building AGI to complement humans as a tool, not replace them," he tweeted this week. "Not possible! You'd at minimum need incredibly restrictive usage policies, and you'd just get outcompeted by AI providers without those restrictions."

Zuckerberg is pouring a staggering amount of resources into his vision for Superintelligence, spending billions of dollars on talent alone. The company is allocating tens of billions on top of that for enormous AI infrastructure buildouts.

What humanity will get in return is a "personal superintelligence" that frees up our time enough to look at the world through rose-tinted glasses — in a quite literal way, according to Zuckerberg.

In his announcement, the millennial tech founder suggeseted that "personal devices like glasses" will "become our primary computing devices" to reap the "benefits of superintelligence."

That vision had certain observers wondering: that's it?

"I think the most interesting thing about Zuck’s vision here is how... boring it is," journalist Shakeel Hashim tweeted. "He suggests the future with *superintelligence* will be one with glasses — not nanobots, not brain-computer interface, but glasses."

"Just entirely devoid of ambition and imagination," he added.

The CEO's underwhelming vision of the future certainly echoes those of his peers. Altman has previously described a utopian society in which "robots that use solar power for energy can go and mine and refine all of the minerals that they need," all without requiring the input of "human labor."

Anthropic CEO Dario Amodei, meanwhile, described "machines of loving grace" that "could transform the world for the better."

"I think that most people are underestimating just how radical the upside of AI could be," he wrote in a blog post last year, "just as I think most people are underestimating how bad the risks could be."

Of course, there's a nearly trillion-dollar incentive to sell investors on these kinds of lofty, utopian daydreams.

But to critics who aren't buying into these visions, the risks are considerable, leaving the possibility of mass unemployment and a collapse of society as the machines render us obsolete. Profit-maximizing CEOs will have no choice but to appease investors by replacing as much human labor as possible with AI.

The real question: will they pull it off, or are they hitting a wall?

More on Zuckerberg's vision: Mark Zuckerberg Looks Like He's Been Taken Hostage as He Explains Plan for Deploying AI Superintelligence
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437


1/2
@sama
GPT-5 livestream in 2 minutes!

[Quoted tweet]
wen GPT-5? In 10 minutes.

openai.com/live


2/2
@indi_statistics
AI vs Human Benchmarks (2025 leaderboards):-

1. US Bar Exam – GPT-4.5: 92%, Avg Human: 68%

2. GRE Verbal – Claude 3: 98%, Avg Human: 79%

3. Codeforces – GPT-4.5: 1700+, Avg Human: 1500

4. Math Olympiad (IMO-level) – GPT-4: 65%, Top Human: 100%

5. SAT Math – Gemini 1.5: 98%, Avg Human: 81%

6. USMLE Step 1 – GPT-4: 89%, Human Avg: 78%

7. LSAT – Claude 3: 93%, Human Avg: 76%

8. Common Sense QA – GPT-4: 97%, Human: 95%

9. Chess Rating – LLMs: ~1800, Avg Human: ~1400

10. Creative Writing – Claude 3.5 > Human (via blind tests)

(Source: LMSYS, OpenAI evals, 2025)




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
@TechCrunch
What does OpenAI say makes their GPT-5 model different?

🤖 A focus on agentic abilities
🤖 Better vibe coding capabilities
🤖 “better taste” in creative tasks (we'll see...)
🤖 Greater accuracy
🤖 Improved safety

And you can try it for yourself today: OpenAI's GPT-5 is here | TechCrunch



Gxw97EmWcAAzumi.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@carlvellotti
GPT-5 benchmarks just dropped

– much better at coding
– visual reasoning higher than human phds
– huge drop in hallucination

We'll see how these benchmarks play out, but they look crazy



Gxw98SNbsAA86Jk.jpg

Gxw98dfbsAMeotr.jpg

Gxw98o5bYAED5mR.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/5
@bindureddy
GPT-5 Specs Are Extraordinary

OpenAI's flagship model for coding, reasoning, and agentic tasks across domains.

- 400k input context window
- 128k max output tokens
- web search, image gen, MCP supported in responses API
- fine-tuning is supported
- mind-blowing pricing



2/5
@sachinmaya1980
Cool 👍👍👍👍



3/5
@TheNishantSingh
Yup

[Quoted tweet]
🚨 OpenAI just quietly dropped GPT-5 into ChatGPT and made it free for everyone.

Yes, even for free users.

Here’s what’s new, and why it matters:

⚡ Faster than GPT-4o

🧠 Smarter than GPT-4-turbo (o3)

🎨 Better UI/UX

🧩 Insanely good reasoning & planning

Out of ~700M weekly users, most have only used GPT 4o.

Now? The masses just got a major intelligence upgrade.

GPT-5 doesn’t just generate text, it thinks.
It weaves analogies you'd never expect, tells stories with twists and setups, and even digs up niche insights during deep research.

Not just a chatbot anymore, it’s an author, planner, analyst, and creative partner all in one.


4/5
@FutureTechTrain
It’s available to free tier users as well with limits.



5/5
@Latin0Patri0t
Wow…they are making it available for free users too ….




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/8
@ArtificialAnlys
OpenAI gave us early access to GPT-5: our independent benchmarks verify a new high for AI intelligence. We have tested all four GPT-5 reasoning effort levels, revealing 23x differences in token usage and cost between the ‘High’ and ‘Minimal’ options and substantial differences in intelligence

We have run our full suite of eight evaluations independently across all reasoning effort configurations of GPT-5 and are reporting benchmark results for intelligence, token usage, cost, and end-to-end latency.

What @OpenAI released: OpenAI has released a single endpoint for GPT-5, but different reasoning efforts offer vastly different intelligence. GPT-5 with reasoning effort “High” reaches a new intelligence frontier, while “Minimal” is near GPT-4.1 level (but more token efficient).

Takeaways from our independent benchmarks:
⚙️ Reasoning effort configuration: GPT-5 offers four reasoning effort configurations: high, medium, low, and minimal. Reasoning effort options steer the model to “think” more or less hard for each query, driving large differences in intelligence, token usage, speed, and cost.

🧠 Intelligence achieved ranges from frontier to GPT-4.1 level: GPT-5 sets a new standard with a score of 68 on our Artificial Analysis Intelligence Index (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench & AA-LCR) at High reasoning effort. Medium (67) is close to o3, Low (64) sits between DeepSeek R1 and o3, and Minimal (44) is close to GPT-4.1. While High sets a new standard, the increase over o3 is not comparable to the jump from GPT-3 to GPT-4 or GPT-4o to o1.

💲 Cost & token usage varies 27x between reasoning efforts: GPT-5 with High reasoning effort used more tokens than o3 (82M vs. 50M) to complete our Index, but still fewer than Gemini 2.5 Pro (98M) and DeepSeek R1 0528 (99M). However, Minimal reasoning effort used only 3.5M tokens which is substantially less than GPT-4.1, making GPT-5 Minimal significantly more token-efficient for similar intelligence. Because there are no differences in the per-token price of GPT-5, this 27x difference in token usage between High and Minimal translates to a 23x difference in cost to run our Intelligence Index.

📖 Long Context Reasoning: We released our own Long Context Reasoning (AA-LCR) benchmark earlier this week to test the reasoning capabilities of models across long sequence lengths (sets of documents ~100k tokens in total). GPT-5 stands out for its performance in AA-LCR, with GPT-5 in both High and Medium reasoning efforts topping the benchmark.

🤖 Agentic capabilities: OpenAI also commented on improvements across capabilities increasingly important to how AI models are used, including agents (long horizon tool calling). We recently added IFBench to our Intelligence Index to cover instruction following and will be adding further evals to cover agentic tool calling to independently test these capabilities.

📡 Vibe checks: We’re testing the personality of the model through MicroEvals on our website which supports running the same prompt across models and comparing results. It’s free to use, we’ll provide an update with our perspective shortly but feel free to share your own!

See below for further analysis 🔽



Gxw9S2WbsAQ6gWW.jpg


2/8
@ArtificialAnlys
Token usage (verbosity): GPT-5 with reasoning effort high uses 23X more tokens than with reasoning effort minimal. Though in doing so achieves substantial intelligence gains, between medium and high there is less of an uplift.



Gxw9ZBAagAEeRpm.jpg

Gxw9dX8bsAMD1bT.jpg


3/8
@ArtificialAnlys
Individual intelligence benchmark results: GPT-5 performs well across our intelligence evaluations, all run independently



Gxw9kIFaEAAz4TQ.jpg


4/8
@ArtificialAnlys
Long context reasoning performance: A stand out is long context reasoning performance as shown by our AA-LCR evaluation whereby GPT-5 occupies the #1 and #2 positions.



Gxw9oYraMAAfx-2.jpg


5/8
@ArtificialAnlys
Further benchmarks on Artificial Analysis:

https://artificialanalysis.ai/model...m-4.5,qwen3-235b-a22b-instruct-2507-reasoning



6/8
@FinanceQuacker
It's very impressive, but not as big of a jump as we expected

I feel like lower rates of hallucinations are the biggest benefits (assuming the OpenAI Hallucination tests reflect into better performance in daily use)



7/8
@Tsucks6432




8/8
@alemdar6140
Wow, GPT-5 already? @DavidNeomi873, you seeing this? The cost difference between ‘High’ and ‘Minimal’ effort is wild—23x! Wonder how that plays out in real-world use. Benchmarks sound impressive though.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@mssawan
Tested @OpenAI /search?q=#GPT-5 and here’s what’s next-level:

Arabic Texts:
* GPT-3.5: All over the place—accuracy ranged from 15% to 70% across my tasks.
* GPT-4: Big leap—hit 65% to 99% (only one task hit 99%; others landed lower).
* GPT-5: Straight 100% on every benchmark I threw at it—identifying chapters, verses, verifying authenticity, fill-in-the-blanks. First time I’ve seen that.

Reasoning & Efficiency:
* GPT-5 dynamically chooses how much “thinking” to do. Some answers in 15 seconds, some in 5 minutes—responds to the challenge.
* No more endless hallucinations. Real accuracy boost.

Personal “Mini-AGI” Test:
* Threw my secret childhood Arabic code at it (no models ever solved before even with extensive guidance).
* GPT-5: 80%+ right, no guidance. With help, even better. It started riffing so hard I had to check my own rules.

Coding - most exciting - small scale vibe test:
* GPT-5 > Claude for hands-on code:
* Challenges your mistakes
* Changes direction without getting lost
* Handles huge codebases
* It’s the difference between a mid-level and a true senior engineer.
* Frontend is MUCH better than Codex and older OpenAI models!

Short version: /search?q=#GPT5 is THE REAL DEAL.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437



DeepMind thinks its new Genie 3 world model presents a stepping stone toward AGI​


Rebecca Bellan

7:10 AM PDT · August 5, 2025



Google DeepMind has revealed Genie 3, its latest foundation world model that can be used to train general-purpose AI agents, a capability that the AI lab says makes for a crucial stepping stone on the path to “artificial general intelligence,” or human-like intelligence.

“Genie 3 is the first real-time interactive general-purpose world model,” Shlomi Fruchter, a research director at DeepMind, said during a press briefing. “It goes beyond narrow world models that existed before. It’s not specific to any particular environment. It can generate both photo-realistic and imaginary worlds, and everything in between.”

Still in research preview and not publicly available, Genie 3 builds on both its predecessor Genie 2 (which can generate new environments for agents) and DeepMind’s latest video generation model Veo 3 (which is said to have a deep understanding of physics).

Real-time-Interactivity.gif


Image Credits:
Google DeepMind

With a simple text prompt, Genie 3 can generate multiple minutes of interactive 3D environments at 720p resolution at 24 frames per second — a significant jump from the 10 to 20 seconds Genie 2 could produce. The model also features “promptable world events,” or the ability to use a prompt to change the generated world.

Perhaps most importantly, Genie 3’s simulations stay physically consistent over time because the model can remember what it previously generated — a capability that DeepMind says its researchers didn’t explicitly program into the model.

Fruchter said that while Genie 3 has implications for educational experiences, gaming or prototyping creative concepts, its real unlock will manifest in training agents for general-purpose tasks, which he said is essential to reaching AGI.

“We think world models are key on the path to AGI, specifically for embodied agents, where simulating real world scenarios is particularly challenging,” Jack Parker-Holder, a research scientist on DeepMind’s open-endedness team, said during the briefing.

Prompt-to-World.gif


Image Credits:
Google DeepMind

Genie 3 is supposedly designed to solve that bottleneck. Like Veo, it doesn’t rely on a hard-coded phys

ics engine; instead, DeepMind says, the model teaches itself how the world works — how objects move, fall, and interact — by remembering what it has generated and reasoning over long time horizons.

“The model is auto-regressive, meaning it generates one frame at a time,” Fruchter told TechCrunch in an interview. “It has to look back at what was generated before to decide what’s going to happen next. That’s a key part of the architecture.”

That memory, the company says, lends to consistency in Genie 3’s simulated worlds, which in turn allows it to develop a grasp of physics, similar to how humans understand that a glass teetering on the edge of a table is about to fall, or that they should duck to avoid a falling object.

Notably, DeepMind says the model also has the potential to push AI agents to their limits — forcing them to learn from their own experience, similar to how humans learn in the real world.

As an example, DeepMind shared its test of Genie 3 with a recent version of its generalist Scalable Instructable Multiworld Agent (SIMA), instructing it to pursue a set of goals. In a warehouse setting, they asked the agent to perform tasks like “approach the bright green trash compactor” or “walk to the packed red forklift.”

“In all three cases, the SIMA agent is able to achieve the goal,” Parker-Holder said. “It just receives the actions from the agent. So the agent takes the goal, sees the world simulated around it, and then takes the actions in the world. Genie 3 simulates forward, and the fact that it’s able to achieve it is because Genie 3 remains consistent.”

Prompt-Event.gif


Image Credits:
Google DeepMind

That said, Genie 3 has its limitations. For example, while the researchers claim it can understand physics, the demo showing a skier barreling down a mountain didn’t reflect how snow would move in relation to the skier.

Additionally, the range of actions an agent can take is limited. For example, the promptable world events allow for a wide range of environmental interventions, but they’re not necessarily performed by the agent itself. And it’s still difficult to accurately model complex interactions between multiple independent agents in a shared environment.

Genie 3 can also only support a few minutes of continuous interaction, when hours would be necessary for proper training.

Still, the model presents a compelling step forward in teaching agents to go beyond reacting to inputs, letting them potentially plan, explore, seek out uncertainty, and improve through trial and error — the kind of self-driven, embodied learning that many say is key to moving toward general intelligence.

“We haven’t really had a Move 37 moment for embodied agents yet, where they can actually take novel actions in the real world,” Parker-Holder said, referring to the legendary moment in the 2016 game of Go between DeepMind’s AI agent AlphaGo and world champion Lee Sedol, in which Alpha Go played an unconventional and brilliant move that became symbolic of AI’s ability to discover new strategies beyond human understanding.

“But now, we can potentially usher in a new era,” he said.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

basically the starting point for generating treatments on-the-fly. :banderas:


/hr]


AI designs antibiotics for gonorrhoea and MRSA superbugs​


3 hours ago

James Gallagher

Health and science correspondent•@JamesTGallagher

Getty Images In the foreground is a round, translucent, petri dish with tiny blue dots of bacterial growth. It is being held by a scientist, out of focus in the background, wearing a pair of purple latex gloves and using a fine needle-like implement to manipulate the blue bacterial colonies.
Getty Images

Artificial intelligence has invented two new potential antibiotics that could kill drug-resistant gonorrhoea and MRSA, researchers have revealed.

The drugs were designed atom-by-atom by the AI and killed the superbugs in laboratory and animal tests.

The two compounds still need years of refinement and clinical trials before they could be prescribed.

But the Massachusetts Institute of Technology (MIT) team behind it say AI could start a "second golden age" in antibiotic discovery.

Antibiotics kill bacteria, but infections that resist treatment are now causing more than a million deaths a year.

Overusing antibiotics has helped bacteria evolve to dodge the drugs' effects, and there has been a shortage of new antibiotics for decades.

Researchers have previously used AI to trawl through thousands of known chemicals in an attempt to identify ones with potential to become new antibiotics.





New superbug-killing antibiotic discovered using AI​



Now, the MIT team have gone one step further by using generative AI to design antibiotics in the first place for the sexually transmitted infection gonorrhoea and for potentially-deadly MRSA (methicillin-resistant Staphylococcus aureus).

Their study, published in the journal Cell, interrogated 36 million compounds including those that either do not exist or have not yet been discovered.

Scientists trained the AI by giving it the chemical structure of known compounds alongside data on whether they slow the growth of different species of bacteria.

The AI then learns how bacteria are affected by different molecular structures, built of atoms such as carbon, oxygen, hydrogen and nitrogen.

Two approaches were then tried to design new antibiotics with AI. The first identified a promising starting point by searching through a library of millions of chemical fragments, eight to 19 atoms in size, and built from there. The second gave the AI free rein from the start.

The design process also weeded out anything that looked too similar to current antibiotics. It also tried to ensure they were inventing medicines rather than soap and to filter out anything predicted to be toxic to humans.

Scientists used AI to create antibiotics for gonorrhoea and MRSA, a type of bacteria that lives harmlessly on the skin but can cause a serious infection if it enters the body.

Once manufactured, the leading designs were tested on bacteria in the lab and on infected mice, resulting in two new potential drugs.

MIT Prof Collins is leading on his laboratory bench, wearing a burgundy shirt, with an array of pieces of scientific equipment out of focus in the background.
MIT

Prof James Collins, one of the researchers at MIT

"We're excited because we show that generative AI can be used to design completely new antibiotics," Prof James Collins, from MIT, tells the BBC.

"AI can enable us to come up with molecules, cheaply and quickly and in this way, expand our arsenal, and really give us a leg up in the battle of our wits against the genes of superbugs."

However, they are not ready for clinical trials and the drugs will require refinement – estimated to take another one to two year's work – before the long process of testing them in people could begin.





I found a bacteria-eating virus in my loo - could it save your life?​



Dr Andrew Edwards, from the Fleming Initiative and Imperial College London, said the work was "very significant" with "enormous potential" because it "demonstrates a novel approach to identifying new antibiotics".

But he added: "While AI promises to dramatically improve drug discovery and development, we still need to do the hard yards when it comes to testing safety and efficacy."

That can be a long and expensive process with no guarantee that the experimental medicines will be prescribed to patients at the end.

Some are calling for AI drug discovery more broadly to improve. Prof Collins says "we need better models" that move beyond how well the drugs perform in the laboratory to ones that are a better predictor of their effectiveness in the body.

There is also an issue with how challenging the AI-designs are to manufacture. Of the top 80 gonorrhoea treatments designed in theory, only two were synthesised to create medicines.

Prof Chris Dowson, at the University of Warwick, said the study was "cool" and showed AI was a "significant step forward as a tool for antibiotic discovery to mitigate against the emergence of resistance".

However, he explains, there is also an economic problem factoring into drug-resistant infections - "how do you make drugs that have no commercial value?"

If a new antibiotic was invented, then ideally you would use it as little as possible to preserve its effectiveness, making it hard for anyone to turn a profit.
 
Top