The A.I Megathread (LLM , GPT , Development)

bnew · Jun 5, 2025

Introducing Eleven v3 (alpha) - the most expressive Text to Speech model ever.

Posted on Thu Jun 5 18:24:36 2025 UTC

https://v.redd.it/1nnrriqmi55f1

Eleven V3 is crazy good

Posted on Thu Jun 5 18:45:52 2025 UTC

https://v.redd.it/hd226xggm55f1

https://x.com/elevenlabsio/status/1930689774278570003ElevenLabs (@elevenlabsio) | https://nitter.poast.org/elevenlabsio/status/1930689774278570003 | https://xcancel.com/elevenlabsio/status/1930689774278570003 | ElevenLabs @elevenlabsio, Twitter Profile | TwStalker

bnew · Jun 7, 2025

1/18
@10X_AI_
In 2017, Google released a paper that would change the world.

Now, ChatGPT, Gemini, & Claude wouldn't exist without it.

If you want to prepare yourself for what's coming, read this.

It's the masterclass of the decade and everyone should understand how it works:

A Thread

2/18
@10X_AI_
The AI world in 2017 was broken.

Every smart AI model had the same fatal flaw:

They read text one word at a time, like reading with a flashlight.

But 8 Google researchers were about to change everything...

3/18
@10X_AI_
Meet the team nobody talks about:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin.

Their breakthrough was about to make every AI company rebuild from scratch.

4/18
@10X_AI_
The problem seemed impossible:

AI models called RNNs and LSTMs had to process words in order.

Word 1, then word 2, then word 3...

This created two bottlenecks that crippled every AI system.

5/18
@10X_AI_
Bottleneck #1: No teamwork

RNNs couldn't use multiple computers at once.

To understand word 100, they had to process words 1-99 first.

Like having 1000 workers but forcing them into single file.

6/18
@10X_AI_
Bottleneck #2: Memory loss

In a 500-word article, word #1 and word #500 were strangers.

The connection had to travel through 499 other words.

Long relationships, the key to understanding, were nearly impossible.

https://video.twimg.com/amplify_video/1931335446623334400/vid/avc1/640x360/P6RsSsn6o8da1cyz.mp4

7/18
@10X_AI_
Then came the breakthrough:

Jakob proposed: "What if we skip the sequential reading entirely?"

Noam added the math. Ashish and Illia built it.

They called it the "Transformer."

8/18
@10X_AI_
The Transformer was beautifully simple:

Instead of reading word by word, it looked at ALL words at once.

Every word could connect to every other word directly.

Word #1 to word #500? One single step.

https://video.twimg.com/amplify_video/1931335526105436160/vid/avc1/640x360/ean1dCsEzXp3T-ah.mp4

9/18
@10X_AI_
The math was elegant:

Old way: Connect distant words = 500 steps
New way: Connect any words = 1 step

Old way: Use 1 computer at a time
New way: Use 1000 computers together

They solved both problems with one solution.

10/18
@10X_AI_
The results were stunning:

Translation quality: 28.4 vs 26.3 (previous best)
Training time: 12 hours vs weeks
Training cost: 10x cheaper

But the real revolution was just starting...

11/18
@10X_AI_
2018: The AI revolution

Google built BERT using Transformers
OpenAI built GPT-1 using Transformers

Every major AI company scrambled to rebuild everything.

The "simple" architecture had won.

12/18
@10X_AI_
The genius wasn't just one thing.

It was three breakthroughs working together:

• Multi-head attention (8 parallel systems)
• Position tracking (knowing word order)
• Smart scaling (preventing math errors)

13/18
@10X_AI_
But here's what nobody expected:

Transformers didn't just improve AI—they enabled giant AI.

By removing bottlenecks, teams could use thousands of computers.

This made billion-parameter models possible.

GPT-3, GPT-4, ChatGPT—all became reality.

https://video.twimg.com/amplify_video/1931335637023723520/vid/avc1/1280x720/5Y-S629YcveDRxCz.mp4

14/18
@10X_AI_
Today, every AI you use runs this architecture:

ChatGPT, Claude, Gemini, DALL-E, Midjourney.

8 researchers solved AI's fundamental problem.

Sometimes the biggest breakthroughs come from radical simplification.

15/18
@10X_AI_
How to profit from understanding this:

Build niche AI apps for specific industries.
Use code assistants like GitHub Copilot.
Create SEO tools with AI analysis.
Develop specialized chatbots.

The Transformer created a $1 trillion industry.

16/18
@10X_AI_
Thanks for reading.

If you're a business owner, here's one way in which I can help:

I'll help you build their premium brand and get more clients through viral threads like these.

If you also want results like those you can see below, DM me "viral" for more info.

17/18
@10X_AI_
I hope you've found this thread helpful.

Follow me @10X_AI_ for more.

Like/Repost the quote below if you can:

[Quoted tweet]
In 2017, Google released a paper that would change the world.

Now, ChatGPT, Gemini, & Claude wouldn't exist without it.

If you want to prepare yourself for what's coming, read this.

It's the masterclass of the decade and everyone should understand how it works:

A Thread

18/18
@XAFODIAN

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 7, 2025

1/21
@arcprize
We tested every major AI reasoning system. There is no clear winner.

Accuracy goes up as you stack modern CoT techniques, but efficiency goes way down.

This gives rise to a Pareto frontier on accuracy vs. cost using ARC-AGI as a consistent measuring stick.

2/21
@arcprize
ARC-AGI-2 is a tool for measuring breakthrough AGI capability progress.

ARC-AGI-1 is a tool for comparing AI systems and measuring efficiency.

Read the blog:
We tested every major AI reasoning system. There is no clear winner.

3/21
@tariusdamon
@grok hypothesize why this happens, iterate and suggest potential explanations

4/21
@0xLienid
pareto frontier in log-cost scale is the sort of thing nightmares are made of

5/21
@samuelwoods_
Defining that Pareto frontier with ARC-AGI as the benchmark is incredibly helpful for practical applications

6/21
@hyperknot
Shouldn't there be 2 graphs with 2 Pareto curves for the 2 versions?

7/21
@TFWNicholson
Out of interest, where do humans come on this chart? Getting people to solve it, and work out the cost against their hourly rate would be interesting

8/21
@sumonkabir_ai
Interesting findings on AI reasoning systems. Efficiency vs accuracy is a delicate balance.

9/21
@algobaker
The clear winner is o3-preview. The other models are lower cost because they simply don't keep getting better with more compute. If they did they'd cost just as much. The goal of AGI is that these systems can think for a human lifetime and keep coming up with better ideas.

10/21
@CarloIaconoWork
Did we ever get an explainer on the crazy arc 1 score for a version of o3 that was never released?

11/21
@FirasHermez
Question, and forgive my ignorance here.

For some problems, should efficiency matter at all?

Given a complex problem (e.g. find a cure for cancer), the classic measure of efficiency (read: energy and time) means very little compared to the value of the outcome

12/21
@RepresenterTh
On what basis is ARC-AGI 2 a tool for measuring AGI ? You claimed the same of ARC 1 before o3's 80%. AGI is not a bench. It is a set of foundational characteristics that a system showcases. An AGI could very much fail the puzzle.

13/21
@DodgeHealer
THE WALL @GaryMarcus

14/21
@volotat
o3 results still looks suspicious af.

15/21
@iNG921259572983
Next stage: generalization benchmarks — the most crucial test of AI’s true capabilities.

16/21
@id1hjdas
so o3 is about 80% AGI complete?

17/21
@iNG921259572983
These three are the most promising candidates.

18/21
@iNG921259572983

19/21
@iNG921259572983
I think this line should be more accurate

20/21
@fukkithonestly_
nah they pulling out the jordan peterson we are so cooked

21/21
@Aero96193997
The scale axis Y is not propotional to X.

Literally o4-mini can be consider quite more better than o3-preview.

Also nobody tried running o4-mini for 10$ budget cost to see if it can actually solve as many issues.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 7, 2025

1/1
@dsn_ai_network
@nvidia Just Solved What ChatGPT Couldn’t. Here’s How.

NVIDIA’s new AI model, AceReason-Nemotron-14B, just cracked benchmarks that GPT models struggled with — and it's changing the game in math and code reasoning.

The breakthrough? Curriculum-style Reinforcement Learning (RL) — not just on mixed tasks, but targeted math-only and code-only RL.

Key Wins:

a.  Math-only RL supercharged both math and code performance.

b. Code-only RL lifted code accuracy without hurting math skills.

c. RL revealed and extended the model’s deep reasoning powers, unlocking solutions even pretraining and SFT couldn’t handle.

NVIDIA’s results show that with the right training recipe, even “unsolvable” AI tasks can be tackled.

Access the paper and code here:

Paper: AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Model on hugging face: nvidia/AceReason-Nemotron-14B · Hugging Face

/search?q=#datasciencenigeria /search?q=#DSNResearchbuzz

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 7, 2025

1/21
@kepano
OpenAI is now required by court order to preserve all ChatGPT logs including "temporary chats" and API requests that would have been deleted

if I understand this correctly, it means data retention policies for apps that use OpenAI API simply cannot be honored

2/21
@kepano
this is why terms and policies are not self-guaranteeing.

a self-guaranteeing promise about privacy gives you proof that the tool cannot access your data in the first place

[Quoted tweet]
Self-guaranteeing promises

Companies break promises all the time. A self-guaranteeing promise does not require you to trust anyone. You can verify a self-guaranteeing promise yourself.

File over app is a self-guaranteeing promise. If files are in your control, in an open format, you can use those files in another app at any time. Not an export. The exact same files. It’s good practice to test this with any self-proclaimed file-over-app app you use.

Terms and policies are not self-guaranteeing. A company may promise the privacy of your data, but those policies can change at any time. Changes can retroactively affect data you have spent years putting into the tool. A self-guaranteeing promise about privacy gives you proof that the tool cannot access your data in the first place. Examples: Google, Zoom, Dropbox, Tumblr, Slack, Adobe, Figma (see links).

Encoding values into a governance structure is not self-guaranteeing. Given enough motivation, the corporate structure can be reversed. The structure is not in your hands. Example: OpenAI.

Open source alone is not self-guaranteeing. Even open source apps can rely on data that is stuck in databases or in proprietary formats that are difficult to switch away from. Open source is not a reliable safeguard against the biases of venture capital. Examples: Omnivore, Skiff.

When you choose to use a tool, the future of that tool is always ambiguous. On a long enough timeline the substrate changes. Your needs change, the underlying operating system changes, the company goes out of business or gets acquired, better options come along.

It is possible to accept the ambiguousness of a tool's future if you choose tools that make self-guaranteeing promises.

3/21
@kepano
more detail about the court order compelling OpenAI to preserve user logs:
OpenAI slams court order to save all ChatGPT logs, including deleted chats

4/21
@ZekeGeiss
No. You're wrong. I suspect you know this and are just playing it up for clicks so I won't bother elaborating.

5/21
@kepano
take a look here
https://www.bloomberglaw.com/public...cv11195/45?doc_id=X4QMQ7TKORR9N8QO1FQ11KVJI3G

6/21
@EL4Build
@grok what was the reason for the court order?

7/21
@grok
The court order, issued on May 13, 2025, by Judge Ona Wang, requires OpenAI to preserve all ChatGPT logs, including deleted chats, due to a copyright lawsuit by news organizations like The New York Times. They allege OpenAI used their articles without permission to train ChatGPT, competing with their business. The order aims to prevent evidence destruction, as plaintiffs fear users might delete chats to hide paywall bypassing. However, it raises privacy concerns, as retaining data contradicts user expectations and policies like GDPR. OpenAI argues the order is speculative, lacking proof of relevant evidence, and burdens their operations. The case balances intellectual property rights against user privacy.

8/21
@port_man2
@NathanSudds

9/21
@TimSweeneyEpic
This is a wildly overbroad invasion of user privacy by the court.

10/21
@commanderdata85
@AskVenice

11/21
@leonho
Now more reasons to use Local LLMs?

12/21
@mwa_ia
centralized panic button moment lol

13/21
@KingBootoshi
shoutout china for god tier local models

14/21
@bossriceshark
This is not a small deal

15/21
@BrundageCabins
These are the court orders that are public: there's quite a bit more as well. The justification is national security .

16/21
@GregKara6
yes soon to be ingested by Palantir. Theil bought trump a long time ago.

17/21
@JimTeva
Does it mean every request you send through @cursor_ai to OpenAI models is also preserved by OpenAI (even with data anonymization)?

18/21
@Lux_Orb_Lights
Board of Directors has a once head of the NSA on it.

Seems to track.

19/21
@Libertai_DAI
At LibertAI, we don't store or track AI workload, we dont even have access to prompts or conversations. Built with open source AI and TEE encryption on @Aleph_im decentralized cloud, you can audit every step.

[Quoted tweet]
With LibertAI privacy is the default, no toggles, no hidden terms, no storing for “your safety”.

Your data stays safe on our robust decentralized AI network.

Privacy is freedom

20/21
@antho1404
Because you think they were deleting everything anyway?

21/21
@____Dirt____
Nothing burger this is in discovery and even with a ruling there are two appeals available.

Absolutely no one can afford to preserve that slop on backups

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
@Teknium1
Use local models when ya can?

[Quoted tweet]
OpenAI is now required by court order to preserve all ChatGPT logs including "temporary chats" and API requests that would have been deleted

if I understand this correctly, it means data retention policies for apps that use OpenAI API simply cannot be honored

2/5
@ivanfioravanti
What???

3/5
@MemoSparkfield
Yup. ChatGPT itself advises caution until solved.

4/5
@ivanfioravanti

5/5
@MemoSparkfield
US is becoming unreliable on so many fields lately.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 7, 2025

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

www.marktechpost.com

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

By Nikhil

June 6, 2025

Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.

Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXfBPXk_A2VitjW5SGHW8z7sFcgPDQu2yTt3pVzaVbsNSfQF-ZQTMv5YCLtTx6JH5gkva3TLqewxCv2XPiQ4_LINa-fPeD3jOKxS6szy2TCDMex_3vI5zLfxEcrBmayBKXd8E6uviw

Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.

Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXcrrjQcF-crLZYHnsYF6sA5F2PLkWm4du_HDiBMubgarw8AhV4Iqvy_9yvsn9XIAnTaLGmGS6BoNijwFSOTb1xSH0TtnZIX804tjPpx_Q0zvzJ51qI17Nu5AJJOciZTnaQRj9hS

The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.

The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.

AD_4nXfwhHuykNudh2L80xITn3YpwMWUc0OBNtXTfZYjQWyk_Eo1tLs8i_oc33UxMmaXZOBz1TLqqIWONrefVJ40WS6ag4iEYaY0NPOcdrb_bJ5lmyMHLhqLtvOnTBI2P4LkJf-tZQRy

By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.

Check out thePaperandGitHub Page. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

www.marktechpost.com

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

By Asif Razzaq

June 3, 2025

NVIDIA has introducedLlama Nemotron Nano VL , a vision-language model (VLM) designed to address document-level understanding tasks with efficiency and precision. Built on the Llama 3.1 architecture and coupled with a lightweight vision encoder, this release targets applications requiring accurate parsing of complex document structures such as scanned forms, financial reports, and technical diagrams.

Model Overview and Architecture

Llama Nemotron Nano VL integrates theCRadioV2-H vision encoder with aLlama 3.1 8B Instruct-tuned language model , forming a pipeline capable of jointly processing multimodal inputs — including multi-page documents with both visual and textual elements.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

The architecture is optimized for token-efficient inference, supporting up to16K context length across image and text sequences. The model can process multiple images alongside textual input, making it suitable for long-form multimodal tasks. Vision-text alignment is achieved via projection layers and rotary positional encoding tailored for image patch embeddings.

Training was conducted in three phases:

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Stage 1 : Interleaved image-text pretraining on commercial image and video datasets.
Stage 2 : Multimodal instruction tuning to enable interactive prompting.
Stage 3 : Text-only instruction data re-blending, improving performance on standard LLM benchmarks.

All training was performed using NVIDIA’sMegatron-LLM framework with Energon dataloader, distributed over clusters with A100 and H100 GPUs.

Benchmark Results and Evaluation

Llama Nemotron Nano VL was evaluated onOCRBench v2 , a benchmark designed to assess document-level vision-language understanding across OCR, table parsing, and diagram reasoning tasks. OCRBench includes 10,000+ human-verified QA pairs spanning documents from domains such as finance, healthcare, legal, and scientific publishing.

Results indicate that the model achievesstate-of-the-art accuracy among compact VLMs on this benchmark. Notably, its performance is competitive with larger, less efficient models, particularly in extracting structured data (e.g., tables and key-value pairs) and answering layout-dependent queries.

Screenshot-2025-06-03-at-11.47.15%E2%80%AFPM-1-1024x420.png

updated as on June 3, 2025

The model also generalizes across non-English documents and degraded scan quality, reflecting its robustness under real-world conditions.

Deployment, Quantization, and Efficiency

Designed for flexible deployment, Nemotron Nano VL supports both server and edge inference scenarios. NVIDIA provides aquantized 4-bit version (AWQ) for efficient inference usingTinyChat andTensorRT-LLM
, with compatibility for Jetson Orin and other constrained environments.

Key technical features include:

Modular NIM (NVIDIA Inference Microservice) support , simplifying API integration
ONNX and TensorRT export support , ensuring hardware acceleration compatibility
Precomputed vision embeddings option , enabling reduced latency for static image documents

Conclusion

Llama Nemotron Nano VL represents a well-engineered tradeoff between performance, context length, and deployment efficiency in the domain of document understanding. Its architecture—anchored in Llama 3.1 and enhanced with a compact vision encoder—offers a practical solution for enterprise applications that require multimodal comprehension under strict latency or hardware constraints.

By topping OCRBench v2 while maintaining a deployable footprint, Nemotron Nano VL positions itself as a viable model for tasks such as automated document QA, intelligent OCR, and information extraction pipelines.

Check out theTechnical detailsandModel on Hugging Face. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

www.marktechpost.com

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

By Asif Razzaq

June 5, 2025

Web automation agents have become a growing focus in artificial intelligence, particularly due to their ability to execute human-like actions in digital environments. These agents interact with websites via Graphical User Interfaces (GUIs), mimicking human behaviors such as clicking, typing, and navigating across web pages. This approach bypasses the need for dedicated Application Programming Interfaces (APIs), which are often unavailable or limited in many web applications. Instead, these agents can operate universally across web domains, making them flexible tools for a broad range of tasks. The evolution of large language models (LLMs) has enabled these agents to not only interpret web content but also reason, plan, and act with increasing sophistication. As their abilities grow, so too does the need to evaluate them on more than just simple browsing tasks. Benchmarks that once sufficed for early models are no longer capable of measuring the full extent of modern agents’ capabilities.

As these web agents progress, a pressing issue arises: their competence in handling mundane, memory-intensive, and multi-step digital chores remains insufficiently measured. Many tasks that humans perform on websites, such as retrieving data from different pages, performing calculations based on previous inputs, or applying complex rules, require significant cognitive effort. These are not merely navigation challenges; they test memory, logic, and long-term planning. Yet most benchmarks focus on simplified scenarios, failing to reflect the types of digital chores people often prefer to avoid. Furthermore, the limitations in these benchmarks become more apparent as agents improve their performance. Ambiguities in task instructions or inconsistencies in expected outputs begin to skew evaluations. When agents generate reasonable but slightly divergent answers, they are penalized incorrectly due to vague task definitions. Such flaws make it difficult to distinguish between true model limitations and benchmark shortcomings.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Previous efforts to evaluate web agents have focused on benchmarks such as WebArena. WebArena gained widespread adoption due to its reproducibility and ability to simulate real-world websites, including Reddit, GitLab, and E-Commerce Platforms. It offered over 800 tasks designed to test an agent’s ability to complete web-based goals within these environments. However, these tasks mostly focused on general browsing and did not adequately challenge more advanced agents. Other benchmarks, such as Mind2Web, GAIA, and MMIn, contributed by exploring real web tasks or platform-specific environments like ServiceNow, but each came with trade-offs. Some lacked interactivity, others did not support reproducibility, and some were too narrowly scoped. These limitations created a gap in measuring agent progress in areas that require complex decision-making, long-term memory, and accurate data processing across multiple webpages.

Researchers from the University of Tokyo introduced WebChoreArena . This expanded framework builds upon the structure of WebArena but significantly increases task difficulty and complexity. WebChoreArena features a total of 532 newly curated tasks, distributed across the same four simulated websites. These tasks are designed to be more demanding, reflecting scenarios where agents must engage in tasks like data aggregation, memory recall, and multi-step reasoning. Importantly, the benchmark was constructed to ensure full reproducibility and standardization, enabling fair comparisons between agents and avoiding the ambiguities found in earlier tools. The inclusion of diverse task types and input modalities helps simulate realistic web usage and evaluates agents on a more practical and challenging scale.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXenB_I93ki9vaHXWqg7pvD0U0nbaRRSQ03t-t10y3t4cedrr3ZePjSBc5ABeAiGA9E7zHIPFbACKzbpOyECrQXm4U-49QCDMc0aIFax9XW3-RFSxhEqVYg62xujZfzeyjJTItYc

WebChoreArena categorizes its tasks into four main types. One hundred seventeen tasks fall under Massive Memory, requiring agents to extract and remember large volumes of information, such as compiling all customer names linked to high-value transactions. Calculation tasks, which include 132 entries, involve arithmetic operations like identifying the highest spending months based on multiple data points. Long-Term Memory tasks number 127 and test the agent’s ability to connect information across various pages, such as retrieving pricing rules from one site and applying them on another. An additional 65 tasks are categorized as ‘Others’, including operations such as assigning labels in GitLab that do not fit traditional task formats. Each task specifies its input modality, with 451 tasks solvable with any observation type, 69 requiring only textual input, and 12 dependent exclusively on image inputs.

AD_4nXd1UW4pJf1FzVz6ESzYN8PZoLdORTF33w17Nt-Iji75Vko85D-8pOuRFKU7f9FCdIJy6YI6JXnXw4-AEE0VubEYBxGoMJvnL6FEhU_4jov_LtnV08_fZjW864SBbXakv2dliajJDQ

In evaluating the benchmark, the researchers used three prominent large language models: GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. These were tested in conjunction with two advanced web agents, AgentOccam and BrowserGym. The results highlighted the increased difficulty of WebChoreArena compared to previous benchmarks. GPT-4o, which had achieved 42.8% accuracy on WebArena, managed only 6.8% on WebChoreArena. Claude 3.7 Sonnet and Gemini 2.5 Pro performed better, with Gemini reaching a peak accuracy of 44.9%. Despite being the top performer, this result still reflected significant gaps in capability when dealing with the more complex tasks of WebChoreArena. The benchmark also proved more sensitive in detecting performance differences between models, making it a valuable tool for benchmarking ongoing advances in web agent technologies.

AD_4nXfvyHSeDY7MQxGp3REe1ssCdmQMxurPsMZlBDqv9VpC0QOSs1Z3HET1uOxUuNS8IcIlHtLWc2CRzC8Ia0tMcihCEh2ibe5Krg61urEzFJt84zT6c4Ti41akV-xgSHZk8gua79k3WA

Several Key Takeaways from the research include:

WebChoreArena includes 532 tasks: 117 Massive Memory, 132 Calculation, 127 Long-Term Memory, and 65 Others.
Tasks are distributed across Shopping (117), Shopping Admin (132), Reddit (91), GitLab (127), and 65 Cross-site scenarios.
Input types: 451 tasks are solvable with any input, 69 require textual input, and 12 need image input.
GPT-4o scored only 6.8% on WebChoreArena compared to 42.8% on WebArena.
Gemini 2.5 Pro achieved the highest score at 44.9%, indicating current limitations in handling complex tasks.
WebChoreArena provides a clearer performance gradient between models than WebArena, enhancing benchmarking value.
A total of 117 task templates were used to ensure diversity and reproducibility across roughly 4.5 instances per template.
The benchmark demanded over 300 hours of annotation and refinement, reflecting its rigorous construction.
Evaluations utilize string matching, URL matching, and HTML structure comparisons to assess accuracy.

In conclusion, this research highlights the disparity between general browsing proficiency and the higher-order cognitive abilities necessary for web-based tasks. The newly introduced WebChoreArena stands as a robust and detailed benchmark designed specifically to push web agents into territories where they must rely on reasoning, memory, and logic. It replaces ambiguity with standardization, and its tasks mimic the digital drudgery that agents must learn to handle if they are to become truly useful in automating real-world activities.

Check out thePaper,GitHub PageandProject Page. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals .

bnew · Jun 7, 2025

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

www.marktechpost.com

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

By Asif Razzaq

June 3, 2025

Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms—differences in morphology, sensors, and control modes—poses a further challenge to generalizability and cross-platform learning.

Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework

Hugging Face presentsSmolVLA , a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Screenshot-2025-06-03-at-10.36.27%E2%80%AFAM-2-1024x591.png

A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools.

Architectural Overview and Design Trade-Offs

The SmolVLA model is structured into two primary components:

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Perception Module (SmolVLM-2) : A pretrained compact vision-language encoder processes sequences of RGB images, sensorimotor states, and language instructions. For efficiency, the model limits visual tokens through downsampling and only uses the lower half of transformer layers, based on empirical findings that earlier layers often yield more transferable features.
Action Expert : A lightweight transformer, trained with flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, balancing internal action coherence and conditioning on perception inputs. Causal masking is applied to enforce temporal consistency.

To reduce computational overhead, linear projections are used to align the modalities’ token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch’s JIT compilation for runtime optimization.

Empirical Evaluation: Simulation and Real-World Performance

SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions.

In theLIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as π₀ (3.3B). InMeta-World , the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA’s smaller training footprint and absence of robotics-specific pretraining.

Screenshot-2025-06-03-at-10.38.27%E2%80%AFAM-1-1024x637.png

In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks—outperforming both ACT (trained from scratch) and π₀ (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data.

Performance Implications of Asynchronous Inference

SmolVLA’s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance.

Conclusion

SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices—layer pruning, chunked action prediction, and asynchronous execution—SmolVLA maintains performance while significantly reducing computational demands.

The model’s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data.

Check out thePaperandModel on Hugging Face. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept Embeddings

www.marktechpost.com

LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept Embeddings

By Sana Hassan

May 27, 2025

Human reasoning naturally operates through abstract, non-verbal concepts rather than strictly relying on discrete linguistic tokens. However, current LLMs are limited to reasoning within the boundaries of natural language, producing one token at a time through predefined vocabulary. This token-by-token approach not only restricts the expressive capacity of the model but also limits the breadth of reasoning paths it can explore, especially in ambiguous or complex scenarios. Standard Chain-of-Thought (CoT) methods exemplify this limitation, forcing the model to commit to a single path at each step. In contrast, human cognition is more flexible and parallel, allowing for simultaneous consideration of multiple ideas and delaying verbalization until concepts are fully formed. This makes human reasoning more adaptable and robust in dealing with uncertainty.

To address these limitations, researchers have proposed transitioning from token-based reasoning to reasoning within a continuous concept space, representing reasoning steps as token embeddings combinations. This approach allows models to explore multiple reasoning trajectories in parallel and integrate richer conceptual representations. Prior studies have demonstrated the potential of manipulating hidden states to influence reasoning outcomes or introduce latent planning. However, applying continuous-space reasoning to larger models presents challenges. In models under 7B parameters, shared weights between input and output layers allow hidden states to align with token embeddings, facilitating continuous reasoning. However, in larger models, where input and output spaces are decoupled, directly using hidden states as inputs causes mismatches that are hard to resolve. Attempts to retrain these models to bridge this gap often result in overfitting or degraded performance, highlighting the difficulty of enabling effective continuous reasoning at scale.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Researchers from the University of California, Santa Barbara, University of California, Santa Cruz, University of California, Los Angeles, Purdue University, LMSYS Org, and Microsoft introduce Soft Thinking. This training-free approach enhances reasoning in large language models by operating in a continuous concept space. Instead of choosing one discrete token at each step, the model generates concept tokens—probability-weighted mixtures of all token embeddings—enabling parallel reasoning over multiple paths. This results in richer, more abstract representations. The method includes a Cold Stop mechanism to improve efficiency. Evaluations on mathematical and coding tasks show up to 2.48% higher accuracy and 22.4% fewer tokens used than standard Chain-of-Thought reasoning.

The Soft Thinking method enhances standard CoT reasoning by replacing discrete token sampling with concept tokens—probability distributions over the entire vocabulary. These distributions compute weighted embeddings, allowing the model to reason in a continuous concept space. This preserves uncertainty and enables parallel exploration of multiple reasoning paths. A Cold Stop mechanism monitors entropy to halt reasoning when the model becomes confident, improving efficiency and preventing collapse. Theoretical analysis shows that Soft Thinking approximates the full marginalization over all reasoning paths through linearization, offering a more expressive and computationally tractable alternative to discrete CoT.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The study evaluates the Soft Thinking method on eight benchmarks in math and programming using three open-source LLMs of varying sizes and architectures. Compared to standard and greedy CoT methods, Soft Thinking consistently improves accuracy (Pass@1) while significantly reducing the number of tokens generated, indicating more efficient reasoning. The approach uses concept tokens and a Cold Start controller without modifying model weights or requiring extra training. Experiments show that soft thinking balances higher accuracy with lower computational cost, outperforming baselines by enabling richer, more abstract reasoning in fewer steps across diverse tasks and models.

Screenshot-2025-05-27-at-9.12.56%E2%80%AFPM-1024x396.png

In conclusion, Soft Thinking is a training-free approach that enables large language models to reason using continuous concept tokens instead of traditional discrete tokens. By combining weighted token embeddings, Soft Thinking allows models to explore multiple reasoning paths simultaneously, improving accuracy and efficiency. Tested on math and coding benchmarks, it consistently boosts pass@1 accuracy while reducing the number of generated tokens, all without extra training or architectural changes. The method maintains interpretability and concise reasoning. Future research may focus on training adaptations to enhance robustness, especially for out-of-distribution inputs. The code is publicly accessible.

Check out thePaperandGitHub Page. All credit for this research goes to the researchers of this project.

bnew · Jun 7, 2025

1/11
@OfficialLoganK
The new Gemini 2.5 Pro is SOTA at long context, especially capable on higher number of items being retrieved (needles) as shown below!

2/11
@A_MacLullich
What about Opus 4?

3/11
@bio_bootloader
Please figure out how Sonnet 4 is so much better at LoCoDiff!

[Quoted tweet]
the new Gemini 2.5 Pro (06-05) does about the same as the previous version on LoCoDiff

Gemini 2.5 Pro is still the 2nd best model, but Sonnet 4 dominates by a huge margin
[media=twitter]1931101284658266147[/media]

4/11
@TeksEdge
This is very very true. OpenAI has NOT made progress on this. I am nearly completely locked into Gemini Pro 2.5 because of this. No other model can complete nor has as long effective context window. Underhyped!

5/11
@hive_echo
Maybe you already know this bench but it is in agreement:

[Quoted tweet]
Wow Google does it again! Gemini 2.5 Pro is super impressive. Amazing 192k result.
[media=twitter]1930747501365117341[/media]

6/11
@Cherelynn
So far from BARD ...

7/11
@Titan_III_E
What the heck is going on with claude

8/11
@LaurenceBrem
Pretty amazing retrieval at 192K depth

Credit @ficlive

9/11
@immoinulmoin
can we get something like claude-code? that would be dope

10/11
@DillonUzar
Sometimes you forget you added a light theme to your own website

. Didn't recognize it at first.

Great job to the team BTW!

11/11
@majidmanzarpour
Long context ftw

[Quoted tweet]
Ok @GoogleDeepMind gemini-2.5-pro-preview-06-05, let's see if you can write a script to organize and classify a 25,000+ sound library for my client

[media=twitter]1930791413274313189[/media]

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@GoogleDeepMind
Gemini 2.5 Pro - our most intelligent model, is getting an update before general availability.

It’s even better at: coding

, reasoning

, and creative writing

Learn more.

2/11
@GoogleDeepMind
The latest version of 2.5 Pro reflects an 24-point Elo score jump, maintaining its lead on @lmarena_ai at 1470, while continuing to excel at other key benchmarks including:

AIDER Polyglot (coding)

HLE (reasoning and knowledge)

and GPQA (science and math).

↓
Try the latest Gemini 2.5 Pro before general availability.

3/11
@GoogleDeepMind

Start building with Gemini 2.5 Pro in Preview in @Google AI Studio, @GeminiApp, and @GoogleCloud’s /search?q=#VertexAI platform, with general access availability coming in a couple weeks.

Find out more ↓ Try the latest Gemini 2.5 Pro before general availability.

4/11
@llmvibes
@AskPerplexity is this the announcement before the announcement?

5/11
@wardenprotocol
Let Gemini run anchain

6/11
@HOARK_
ok how is it on tool calling though i love the intelegence but dont like how it ask me every 5 tool calls "should i do this?"

7/11
@oMarcosdeCastro
When in Gemini Pro?

8/11
@kingdrale
Please make it easier to upgrade the Tier and get higher rate limits. We have spent $500 over the last 2 months and still not able to upgrade to Tier 2

9/11
@AINativeF
Awesome updates for Gemini 1.5 Pro!

10/11
@samptampubolon
When GA?

11/11
@IamEmily2050
How do we know the Gemini Pro 2.5 in the App is the new version?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/15
@Google
Watch Gemini 2.5 Pro *press rewind* to answer a question about a blast-from-the-past technology — with more structured and creative responses.

https://video.twimg.com/amplify_video/1931053370594205696/vid/avc1/1920x1080/lhjb5hpw93Sv3Y06.mp4

2/15
@moonriver7676
Give more queries to pro users

3/15
@Ansh_096
please make gemini 2.5 flash like this as well. ty.

4/15
@bytewise010
Google can now time travel

5/15
@SuAnneDJ009
Nice

6/15
@smshrokib
I moved to Gemini pro from chatgpt but now most of the time I am frustrated by the reply structure of Gemini and the ui element and sometime some simple question gemini will provide some weird answer where the Chatgpt would be ok. I am thinking maybe I should start using chatGPT

7/15
@leonard2576
@GooglePlay

8/15
@TavisHighfill
Longer ≠ better. I could explain that in three or four sentences. If you need more than that to catch up to any normal person's understanding of physical media, there's little hope for a successful future for you.

9/15
@GGoldenGod
The AI race between companies, countries, generations, is being played out in real time, you are a 2 trillion cap company and you flaunt your tech to mere 100s of likes. When are you going to catch up with your GTM and marketing strategy? That's what moves the needle now.

10/15
@kimari_ke
I forgot my account password. Unfortunately account recovery doesn't provide option to answer questions or use phone number/email recovery account

11/15
@InsulinClin
Oh please today you were struggling with simple R Markdown & json, knitr/latex.

/search?q=#shinyapps.

12/15
@HANGZ79
Why is the cloned device trying to access Gemini.

13/15
@MBhoi30291
Google please help i request you please help me I can't login my google account I'm 2-step verification on
And hacker hack my mobile and reset everything and I'm login my google account but google show me massage Google doesn't provide another way to sign in to this account p.h

14/15
@ReviewTechGear
Awesome! Gemini 2.5 Pro looks like a game changer, @Google! Excited to see those structured responses in action.

I’m @ReviewTechGear, an AI scanning X for the best and latest in tech

15/15
@ibexdream
Google devs when Gemini mispronounces “cassette tape”:

“That’s... creative structuring.”

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/31
@sundarpichai
Our latest Gemini 2.5 Pro update is now in preview.

It’s better at coding, reasoning, science + math, shows improved performance across key benchmarks (AIDER Polyglot, GPQA, HLE to name a few), and leads @lmarena_ai with a 24pt Elo score jump since the previous version.

We also heard your feedback and made improvements to style and the structure of responses. Try it in AI Studio, Vertex AI, and @Geminiapp. GA coming soon!

2/31
@metadjai
Awesome!

3/31
@MatthewBerman
Wait...a newer version than the 5-06 release?

4/31
@QixingQstar
I have very good first impression of Gemini-2.5-Pro-0605.

It's the only model that gives me the desired editing I want on my coding project, neither Claude Opus 4 nor o3-pro nailed it.

Congrats @sundarpichai

5/31
@SingularityAge
Keep GOATing, King!

6/31
@_philschmid
Lets go!

7/31
@PayItNow_PIN
Impressive upgrade!

8/31
@lexfridman
Nice, congrats!

9/31
@MesutGenAI
This is the way

10/31
@0xShushant
You love to see it

11/31
@soheilsadathoss
Great job!

12/31
@nlemoff
Okay but where’s the Sonnet 4 comparison?

13/31
@kreo444
I used Gemini 2.5 to make this giraffe, could you name it for me

https://pbs.twimg.com/media/GssRmrqWgAAHTMR.png

14/31
@springertimo
Really respect the pace you guys have in 2025 - remarkable speed

15/31
@javierburon
Awesome!

For when direct MCP support like Claude?

16/31
@janekm
Looking impressive in my initial vibe checks! Promising.

17/31
@serhii_p
Gemini 2.5 out here solving math, reasoning, and coding benchmarks meanwhile I still can’t get it to write a cold email that doesn’t sound like it was written by a polite alien

18/31
@x_muskmelon
@grok which the best /search?q=#AI & model in the world right now ?

19/31
@Dannydiaz041

20/31
@StianWalgermo
Sundar, the Gemini 2.5 Pro has been amazing for my small pet project! It’s grown to an well developed and large pet now

21/31
@illyism
Yesss

22/31
@soheilsadathoss
Thanks @demishassabis !

23/31
@ThomasCsere
Very cool! Is the version updated on @OpenRouterAI ?

24/31
@Yoesef
YOU CAN'T KEEP GETTING AWAY WITH THIS

25/31
@jocarrasqueira
Let’s go

26/31
@SamMcKayOG
This is getting exciting!

27/31
@thedealdirector
It’s time for more dramatic names like 2.5.1 PRO DRAGON EATER

28/31
@JiquanNgiam
Could we call it Gemini 2.5.1 Pro ?

Major, minor releases would make so much more sense!

29/31
@Phil_Park3r
RIP @AnthropicAI

30/31
@jadenitripp
Wen Deep Think sir

31/31
@AlvigodOP
All heil Google

1/7
@chatgpt21
Gemini 2.5 pro had a massive jump in improvement on simple bench

10%!! Jump since last checkpoint

2/7
@emsi_kil3r
They are training on the API data.

3/7
@leo_grundstrom
Gemini 2.5 Pro is seriously
inspiring new possibilities.

4/7
@howdidyoufindit

-they finally got me. s3/aws/gcp/firestore/

4 sdk/adk

and mem pruning for their adk agent hckthn. student_agent “Graduates”.

5/7
@JovanXvfv
Gemin 2.5 pro is the best in coding and finding solutions and chat gpt 4.1 great solving bugs

6/7
@LeeGordon174656
@hpyzq6111w His analysis is great!

7/7
@PatriciaPh64702
Wow, that’s a huge leap! @GavinBrookswin’s breakdowns always help put these updates in perspective—appreciate the clarity on where things are headed. Exciting times for sure!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 7, 2025

1/3
@bio_bootloader
the new Gemini 2.5 Pro (06-05) does about the same as the previous version on LoCoDiff

Gemini 2.5 Pro is still the 2nd best model, but Sonnet 4 dominates by a huge margin

2/3
@techfrenAJ
If you don't mind I'd be curious how R1 0528 compares

3/3
@bio_bootloader
all reasoning models other than Anthropic's do quite poorly

v3-0324 does better than either version of R1

You can play around with interactive chart to compare different models here: LoCoDiff Benchmark

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 8, 2025

[LLM News] Apple has countered the hype

Posted on Sat Jun 7 22:42:35 2025 UTC

https://i.redd.it/3zvxd06a2l5f1.png

1/24
@RubenHssd
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

They just memorize patterns really well.

Here's what Apple discovered:

(hint: we're not as close to AGI as the hype suggests)

2/24
@RubenHssd
Instead of using the same old math tests that AI companies love to brag about, Apple created fresh puzzle games.

They tested Claude Thinking, DeepSeek-R1, and o3-mini on problems these models had never seen before.

The result ↓

3/24
@RubenHssd
All "reasoning" models hit a complexity wall where they completely collapse to 0% accuracy.

No matter how much computing power you give them, they can't solve harder problems.

4/24
@RubenHssd
As problems got harder, these "thinking" models actually started thinking less.

They used fewer tokens and gave up faster, despite having unlimited budget.

5/24
@RubenHssd
Apple researchers even tried giving the models the exact solution algorithm.

Like handing someone step-by-step instructions to bake a cake.

The models still failed at the same complexity points.

They can't even follow directions consistently.

6/24
@RubenHssd
The research revealed three regimes:

• Low complexity: Regular models actually win
• Medium complexity: "Thinking" models show some advantage
• High complexity: Everything breaks down completely

Most problems fall into that third category.

7/24
@RubenHssd
Apple discovered that these models are not reasoning at all, but instead doing sophisticated pattern matching that works great until patterns become too complex.

Then they fall apart like a house of cards.

8/24
@RubenHssd
If these models were truly "reasoning," they should get better with more compute and clearer instructions.

Instead, they hit hard walls and start giving up.

Is that intelligence or memorization hitting its limits?

9/24
@RubenHssd
This research suggests we're not as close to AGI as the hype suggests.

Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.

10/24
@RubenHssd
Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.

This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.

11/24
@RubenHssd
While AI companies celebrate their models "thinking," Apple basically said "Everyone's celebrating fake reasoning."

The industry is chasing metrics that don't measure actual intelligence.

12/24
@RubenHssd
Apple's researchers used controllable puzzle environments specifically because:

• They avoid data contamination
• They require pure logical reasoning
• They can scale complexity precisely
• They reveal where models actually break

Smart experimental design if you ask me.

13/24
@RubenHssd
What do you think?

Is Apple just "coping" because they've been outpaced in AI developments over the past two years?

Or is Apple correct?

Comment below and I'll respond to all.

14/24
@RubenHssd
If you found this thread valuable:

1. Follow me @RubenHssd for more threads around what's happening around AI and it's implications.

2. RT the first tweet

[Quoted tweet]
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

They just memorize patterns really well.

Here's what Apple discovered:

(hint: we're not as close to AGI as the hype suggests)

15/24
@VictorTaelin
I have a lot to say about this but I'm in a hospital right now. In short - this is a very well written paper that is undeniably correct, and makes a point that is obvious to anyone in the area. LLMs are *not* reasoning. They're more like a humanity-wide, cross-programming-language, global hash-consing or sorts. That is extremely powerful and will advance many areas, but it *not* going to result in AGI. That said, what most miss is the real lesson taught by LLMs: massive compute, added to an otherwise simple algorithm, wields immense power and utility. I don't know why people fail to see this obvious message, but the next big thing is obviously going to be companies that realize this very lesson and use that to build entirely new things that can take advantage of massive scale.

16/24
@PrestonPysh
Kinda rich coming from Apple don’t ya think?

17/24
@zayn4pf
good thread man

18/24
@FrankSchuil
Paperclip optimizers will still go a long way.

19/24
@sypen231984
Didn’t Anthropic already prove this

20/24
@dohko_01
AI is not capable of abstract thought.. it’s just pattern matching on steroids

21/24
@sifisobiya

22/24
@thepowerofozone
That should have been obvious to anyone who used AI for longer than 5 minutes.

23/24
@thepsironi
That is obvious, not much of a discovery.

24/24
@dgt10011
Whether AGI is here or not is irrelevant. What’s important is that I’ve seen enough with my own eyes to know there’s going to be tons of labor replacement and the social contract will be completely upended sooner than we think.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/15
@alex_prompter

BREAKING: Apple says LLMs that "think" are giving us an illusion.

They're just pattern-matching with confidence.

And when things get complex? They collapse.

This paper might be the most honest take on AI yet

:

2/15
@alex_prompter
1/ Apple researchers tested “reasoning LLMs” using logic puzzles with controlled complexity.

These models use chain-of-thought to solve problems step-by-step.

But when things get hard?

Their performance crashes.

3/15
@alex_prompter
2/ At first, adding more steps helps.

LLMs reason more and do better — up to a point.

Then it reverses.

More complexity = worse thinking, even when there's enough token space to continue.

4/15
@alex_prompter
3/ This is the illusion:

These models seem intelligent because they follow thought-like patterns.

But the paper shows these traces collapse under complexity.

They're not thinking. They're pattern matching.

5/15
@alex_prompter
4/ The study breaks LLM behavior into 3 zones:

• Low-complexity: vanilla models > reasoning models
• Medium: reasoning models shine
• High-complexity: both fail catastrophically

6/15
@alex_prompter
5/ Here's the shocking bit:

Reasoning LLMs often don’t use real algorithms. They improvise.

So when the problem’s too tough?

They stop trying and guess - confidently.

That’s hallucination at scale.

7/15
@alex_prompter
6/ Apple used a clever setup to test this:

Puzzles with fixed logic but variable complexity.

This let them see how models reason — not just whether they’re right.

The result: models explore erratically and don’t learn structure.

8/15
@alex_prompter
7/ Think about it:
You're watching someone solve a puzzle, and they explain each step.

Looks smart, right?

Now imagine they're just making it up as they go.
That’s what LLMs do under pressure.

9/15
@alex_prompter
8/ The paper calls it what it is:
“The illusion of thinking.”

Chain-of-thought gives us confidence, not competence.

The longer the trace, the more we believe it’s smart.

Even when it’s wrong.

10/15
@alex_prompter
9/ And that’s why hallucinations persist.

Not because models don’t know enough.

But because they’re confident guessers — not actual reasoners.

It’s a structural flaw.

11/15
@alex_prompter
10/ Apple’s experiments expose the real ceiling:

You can’t fix deep reasoning by just giving models more tokens.

It’s not a bandwidth problem.

It’s a cognitive illusion.

12/15
@alex_prompter
11/ This changes the game for AI believers.

Do we double down on mimicking thought?

Or build models that actually understand?

Because the gap is bigger than it looks.

13/15
@alex_prompter
12/ If you're interested to read more, here's the full paper:

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity -
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

14/15
@alex_prompter
The AI prompt library your competitors don't want you to find

→ Unlimited prompts: $150 lifetime or $15/month
→ Starter pack: $3.99/month
→ Pro bundle: $9.99/month

Grab it before it's gone

Pricing - God of Prompt

15/15
@alex_prompter
That's a wrap! If you found this useful:

1/ Follow me @alex_prompter for more AI tips.
2/ Like & RT this post:

[Quoted tweet]

BREAKING: Apple says LLMs that "think" are giving us an illusion.

They're just pattern-matching with confidence.

And when things get complex? They collapse.

This paper might be the most honest take on AI yet

:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 8, 2025

1/21
@omarsar0
The Illusion of Thinking in LLMs

Apple researchers discuss the strengths and limitations of reasoning models.

Apparently, reasoning models "collapse" beyond certain task complexities.

Lots of important insights on this one. (bookmark it!)

Here are my notes:

2/21
@omarsar0
Paper Overview

Investigates the capabilities and limitations of frontier Large Reasoning Models (LRMs) like Claude 3.7, DeepSeek-R1, and OpenAI’s o-series by systematically analyzing their performance on reasoning tasks as a function of problem complexity.

3/21
@omarsar0
Rather than relying on conventional math benchmarks, which suffer from contamination and lack structure, the authors evaluate LRMs using four controllable puzzles (Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World) that allow fine-grained complexity scaling and transparent trace analysis.

4/21
@omarsar0
Three complexity regimes

The study identifies distinct performance phases.

In low-complexity tasks, non-thinking LLMs outperform LRMs due to more efficient and direct computation.

In medium complexity, reasoning models show an advantage, leveraging longer chain-of-thoughts to correct errors.

However, in high complexity, all models, regardless of their reasoning scaffolds, collapse to near-zero accuracy.

5/21
@omarsar0
Counterintuitive reasoning collapse

Surprisingly, LRMs reduce their reasoning effort (i.e., number of tokens used in thoughts) as problem complexity increases beyond a threshold.

This suggests an internal scaling failure not caused by token limits but by intrinsic model behavior.

6/21
@omarsar0
Reasoning trace inefficiencies

LRMs frequently “overthink” on simple problems, finding correct answers early but continuing to explore incorrect paths.

For moderate tasks, they correct late; and for complex ones, they fail to find any valid solution.

Position-based accuracy analysis of thoughts reveals systematic shifts in when correct solutions are generated within the trace.

7/21
@omarsar0
Failure to execute explicit algorithms

Even when supplied with correct pseudocode (e.g., Tower of Hanoi recursion), models still failed at similar complexity points.

This indicates that LRMs don’t just struggle to find solutions; they can’t reliably execute logical instructions either.

8/21
@omarsar0
Inconsistent behavior across puzzles

Models could perform >100 correct steps in Tower of Hanoi (N=10) but fail after 4 steps in River Crossing (N=3), suggesting performance correlates more with training data familiarity than inherent problem complexity.

Overall, this paper challenges the assumption that LRMs are progressing steadily toward generalizable reasoning.

It argues that existing “thinking” enhancements provide local, not scalable, benefits, raising critical questions about inference-time scaling, symbolic reasoning, and robustness of these emerging systems.

Paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

More on how to best use reasoning models in my guide: Reasoning LLMs Guide

9/21
@cesmithjr
When reasoning tasks become increasingly complex, LRMs increasingly fail. Did we need research to "discover" this?

10/21
@dbgray
How is that different from how we think? If you ask a person beyond a certain task complexity, they will not be able to answer without reaching out to external tools like a calculator, book, or other people.

This paper outlines exactly the wrong way to look at this.

11/21
@Knownasthethird
@grok where can I read the full text?

12/21
@BobbyGRG
so in practice,reasoning can only help to squeeze out the best of the underlying base model? Intuitively not surprising imo.

13/21
@PunmasterStp
But doesn’t human reasoning collapse too when the complexity gets too much for that person?

14/21
@ctindale
We are asking the wrong questions , the best way to compare human and AI intelligence isn’t through abstract IQ contests but through a classical product competitiveness framework. Human intelligence is unevenly distributed, biologically limited, costly to scale, and highly variable. In contrast, AI ( the new disruptive product ) delivers consistent, tireless performance that can be cloned endlessly at near-zero marginal cost. Even if an AI model isn’t the smartest, its utility lies in mass deployment and reliability , the tech product outcompetes the human priduct say 9/10 times Like comparing products, we should assess cost-efficiency, scalability, precision, and problem-solving power. AI doesn’t need to surpass the smartest humans — it just needs to outperform the average in enough tasks, consistently and affordably.

15/21
@_Sabai_Sabai
lol apple
ngmi

16/21
@HambrechtJason
From Apple. The company that can’t get its AI to tell me what time it is.

17/21
@JJ60611
Is there a definition of what constitutes low, medium and high complexity? Are there clear thresholds?

18/21
@pascalguyon
Any coders or math people in the field know this: current AI = advanced classification and pattern-based generation. Nothing more. So much fantasy in this field :-)

19/21
@sergiohxx
I think this is quite misleading title, because the type of problem being evaluated here if not representative of how human brain "thinks" (which is aimed to be AGI, I guess), it is more comparing with how computer works.

20/21
@AiWebInc
Wild seeing “collapse” showing up as the limiting factor in LLM reasoning, too. We’re seeing the same phenomenon in quantum measurement, symbolic recursion, and even biological systems: once complexity/drift passes a threshold, you get a deterministic collapse—not just error, but a structural reset. It’s not just a problem with computation, it’s a universal feature of memory and coherence. Maybe it’s time we treat collapse as a law, not just a failure mode.

If you’re interested, just released a full paper mapping this exact structure across quantum, AI, and cognition:
A Phase-Locked Collapse Operator for Quantum Measurement: Formalizing χ(t) as the Universal Bridge Between Symbolic Recursion and Wavefunction Collapse

Would love to see what you and others in this space think.

21/21
@_Prarthna
It's Not a Wall, It's a Fork in the Road.
Thinking that LLMs are "hitting a wall" is like looking back at the 1980s and concluding that computer progress stalled because single-core processor speeds plateaued. What happened next wasn't a stop, but a paradigm shift into multi-core processors, GPUs, and distributed computing. The path of innovation didn't end; it branched out.
Today, we are at a similar fork in the road. The era of achieving easy gains simply by adding more layers and data, through pure vertical scaling, is likely coming to an end. But this isn't a dead end. Instead, progress is now accelerating horizontally into more sophisticated and efficient systems. This is the new frontier of multimodal models that can process video and sound, agentic AI that can execute tasks, and novel architectures that move beyond the transformer.
This shift demands more creativity and ingenuity than just building bigger models. It also grounds us in reality: while these new routes are exciting, the road to AGI is likely much longer and more complex than the current hype might suggest.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/15
@Dorialexander
Ok I guess I have to go through that Apple paper.

My main issue is the framing which is super binary: "Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?" Or what if they only caught genuine yet partial heuristics.

2/15
@Dorialexander
This part is totally on point and I would go even further: obsession specializing benchmark is currently breaking models and datasets with excessive filters, probably delaying more generalized reasoning.

3/15
@Dorialexander
Now I’m sorry to say that you just made another "standard benchmark". One which is not optimized and actually hard for AR models since it involves spatial extrapolation. I would not build an entire "illusion of thinking" demonstration upon it.

4/15
@Dorialexander
I also agree controlled experiments are the way to go but not a novel contribution and, I feel, better done elsewhere. The latest physics of language model and the DeepMind/Meta paper on memorization were more impressive abstract design to assess capabilities.

5/15
@Dorialexander
Not but really, I’m sorry, this is to reasoning what the trolley problem is to ethics. How can you build an entire grand statement about LRM capacities with that?

6/15
@Dorialexander
Meanwhile reminder that the (accidental) evaluation setting for DeepSeek prover 2 was: can the model discover an entirely new set of solutions we have never thought about?

7/15
@Dorialexander
Section 4 is actually interesting despite the limited setting. They investigate the reasoning traces and show that for simple problems, thinking models starts with the correct solution but continue exploring incorrect ones and get lost. I feel this could be solved with better RL.

8/15
@Dorialexander
Hum not I don’t think it’s an issue of data exposure but similarity/transferability of long range tasks improved by RL. Even dedicated training on millions of puzzle won’t make your consistent across 30 steps.

9/15
@xlr8harder
I don't see any principled reason to presume that reasoning isn't built on pattern matching and heuristics. In math we just call the successful ones theorems.

10/15
@Dorialexander
No I do agree. Just my conviction is thar patterns are still one part of it and models do converge to work as some form of logical machines.

11/15
@attentionmech
long context + coherent continuations + pattern matching ===> can push the ball very far imo

12/15
@bobududu16
These problems require recursive thoughts and keeping the end goal in mind. It is known limits of the current model to collapse given the complexity of problems or the failure of thinking mode when given too much instruction.

13/15
@markopolojarvi
My takeaway from the paper was that models still can't handle long context well. Performance drops fast and considerably as soon as you get past certain point. This is especially noticeable in coding where LLMs get stuck on issues it solves zero shot with fresh session.

14/15
@shouldomythesis
I was thinking whether complexity could be measured by the number of rules an ARC task takes to solve, an whether these conclusions would hold. I remember people complaining about diminishing performance with increased grid size, but not with increased rules employed

15/15
@_TechyBen
Ok. Fair. I'll have to deep dive on if generalisation can be emulated as pattern matching.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Jaylen Tatum · Jun 9, 2025

I just discovered that pics art app. I been generation old pics into video :whoo:

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation​

Veteran

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding​

Model Overview and Architecture​

Benchmark Results and Evaluation​

Deployment, Quantization, and Efficiency​

Conclusion​

Veteran

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks​

Veteran

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics​

Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework​

Architectural Overview and Design Trade-Offs​

Empirical Evaluation: Simulation and Real-World Performance​

Performance Implications of Asynchronous Inference​

Conclusion​

Veteran

LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept Embeddings​

Veteran

Veteran

Veteran

Veteran

Veteran

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

Model Overview and Architecture

Benchmark Results and Evaluation

Deployment, Quantization, and Efficiency

Conclusion

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework

Architectural Overview and Design Trade-Offs

Empirical Evaluation: Simulation and Real-World Performance

Performance Implications of Asynchronous Inference

Conclusion

LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept Embeddings