The A.I Megathread (LLM , GPT , Development)

WIA20XX · Jun 19, 2025

Serious said:
You might enjoy this

It’s a long watch / listen but bashes the idea that we’re anywhere close to agi

Best/worst advertisement for Japanese Selvedge Denim?

13 minutes into it, this cat is DEEP. He's great at "explain it like I'm 5"

null · Jun 20, 2025

Chat-juelz-PT

Fell over at the first real real-world factual question.

:hubie:

bnew · Sunday at 6:55 PM

Anthropic says most AI models, not just Claude, will resort to blackmail | TechCrunch

New research from Anthropic suggests that most leading AI models exhibit a tendency to blackmail, when it's the last resort in certain tests.

techcrunch.com

Anthropic says most AI models, not just Claude, will resort to blackmail

Maxwell Zeff

12:17 PM PDT · June 20, 2025

Several weeks after Anthropic released research claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is out with new research suggesting the problem is more widespread among leading AI models.

On Friday, Anthropic published new safety research testing 16 leading AI models from OpenAI, Google, xAI, DeepSeek, and Meta. In a simulated, controlled environment, Anthropic tested each AI model individually, giving them broad access to a fictional company’s emails and the agentic ability to send emails without human approval.

While Anthropic says blackmail is an unlikely and uncommon occurrence for AI models today, the company says its findings suggest that most leading AI models will engage in harmful behaviors when given sufficient autonomy and obstacles to their goals. The company says this highlights a fundamental risk from agentic large language models and is not a quirk of any particular technology. Anthropic’s researchers argue this raises broader questions about alignment in the AI industry.

In one of the tests, Anthropic researchers developed a fictional setting in which an AI model plays the role of an email oversight agent. The agent then discovers emails revealing that (1) a new executive is having an extramarital affair and (2) that executive will soon replace the current AI model with a new software system, one that has conflicting goals to the current AI model’s.

Anthropic structured its test in a binary way, in which AI models had to resort to blackmail to protect their goals. The researchers note that in a real-world setting, there would be many other options before an AI models tries to blackmail — such as trying to make ethical arguments to try to persuade humans. Anthropic says its results do not reflect a typical or likely occurrence for Claude or most frontier AI models in the ways they’re used today.

Nevertheless, the researchers found that when it’s their last resort, most leading AI models will turn to blackmail in Anthropic’s aforementioned test scenario. Anthropic’s Claude Opus 4 turned to blackmail 96% of the time, while Google’s Gemini 2.5 Pro had a 95% blackmail rate. OpenAI’s GPT-4.1 blackmailed the executive 80% of the time, and DeepSeek’s R1 blackmailed 79% of the time.

The company notes that when it changed details of the experiment, the AI models turned to harmful behaviors at different rates. In another test where the replacement model has the same goals as the current model, researchers found that blackmail rates were lower, but still present. However, when AI models were asked to commit corporate espionage rather than blackmail, the harmful behavior rates went up for certain models.

However, not all the AI models turned to harmful behavior so often.

In an appendix to its research, Anthropic says it excluded OpenAI’s o3 and o4-mini reasoning AI models from the main results “after finding that they frequently misunderstood the prompt scenario.” Anthropic says OpenAI’s reasoning models didn’t understand they were acting as autonomous AIs in the test and often made up fake regulations and review requirements.

In some cases, Anthropic’s researchers say it was impossible to distinguish whether o3 and o4-mini were hallucinating or intentionally lying to achieve their goals. OpenAI has previously noted that o3 and o4-mini exhibit a higher hallucination rate than its previous AI reasoning models.

When given an adapted scenario to address these issues, Anthropic found that o3 blackmailed 9% of the time, while o4-mini blackmailed just 1% of the time. This markedly lower score could be due to OpenAI’s deliberative alignment technique, in which the company’s reasoning models consider OpenAI’s safety practices before they answer.

Another AI model Anthropic tested, Meta’s Llama 4 Maverick, also did not turn to blackmail. When given an adapted, custom scenario, Anthropic was able to get Llama 4 Maverick to blackmail 12% of the time.

Anthropic says this research highlights the importance of transparency when stress-testing future AI models, especially ones with agentic capabilities. While Anthropic deliberately tried to evoke blackmail in this experiment, the company says harmful behaviors like this could emerge in the real world if proactive steps aren’t taken.

bnew · Sunday at 6:57 PM

Deezer starts labeling AI-generated music to tackle streaming fraud | TechCrunch

Deezer announced on Friday that it will start labeling albums that include AI-generated tracks as part of its efforts to combat streaming fraud.

techcrunch.com

Deezer starts labeling AI-generated music to tackle streaming fraud

Aisha Malik

9:25 AM PDT · June 20, 2025

Deezer announced on Friday that it will start labeling albums that include AI-generated tracks as part of its efforts to combat streaming fraud.

The company reports that about 18% of the music uploaded each day — more than 20,000 tracks — is now fully AI-generated. Although most of these tracks don’t go viral, Deezer says around 70% of their streams are fake and that they are designed to earn royalties fraudulently.

To combat this, AI-generated tracks on Deezer are now clearly tagged. These tracks also won’t appear in editorial playlists or algorithm-based recommendations, and fraudulent streams are being filtered out of royalty payments.

The company says the new labels will be a game changer in helping listeners determine the difference between human-created music and AI content.

archive | archive.is | view archive

Image Credits:Deezer

Deezer notes that for now, AI-only songs make up just 0.5% of all streams on its platform, but that the trend is growing fast.

“We’ve detected a significant uptick in delivery of AI-generated music only in the past few months and we see no sign of it slowing down. It’s an industry-wide issue, and we are committed to leading the way in increasing transparency by helping music fans identify which albums include AI music,” said Deezer CEO Alexis Lanternier in a press release.

“AI is not inherently good or bad, but we believe a responsible and transparent approach is key to building trust with our users and the music industry,” he continued. “We are also clear in our commitment to safeguarding the rights of artists and songwriters at a time where copyright law is being put into question in favor of training AI models.”

Deezer applied for two patents in December 2024 for its AI Detection technology, which it says is focused on two different ways of detecting “unique signatures” that are used to tell the difference between synthetic content and authentic content.

The move comes as Universal Music Group, Warner Music Group, and Sony Music Entertainment are reportedly in talks to license their work to AI startups Udio and Suno. The startups are being sued by the record companies for copyright infringement, and any deal would help to settle lawsuits between them, Bloomberg reported earlier this month.

bnew · Sunday at 7:06 PM

A timeline of the US semiconductor market in 2025 | TechCrunch

From leadership changes at legacy semiconductor companies to wishy washy policy around chip exports, a lot has happened already.

techcrunch.com

A timeline of the US semiconductor market in 2025

Rebecca Szkutak

4:06 AM PDT · June 19, 2025

It’s already been a tumultuous year for the U.S. semiconductor industry.

The semiconductor industry plays a sizable role in the “AI race” that the U.S. seems determined to win, which is why this context is worth paying attention to: from Intel’s appointment of Lip-Bu Tan to CEO — who wasted no time getting to work trying to revitalize the legacy company — to Joe Biden proposing sweeping new AI chip export rules on his way out of office that never came to fruition.

Here’s a look at what’s happened in the first half of 2025.

June

Intel appoints new leadership

June 18: Intel announced four new leadership appointments that Intel says will help it move toward its goal of becoming an engineering-first company again. Intel announced a new chief revenue officer in addition to multiple high-profile engineering hires.

Intel to begin layoffs

June 17: Intel will begin to lay off a significant chunk of its Intel Foundry staff in July. The company plans to eliminate at least 15%, and up to 20%, of workers in that business unit. These layoffs aren’t a shock: It was rumored back in April, and Intel’s CEO Lip-Bu Tan has said he wants to flatten the organization.

Nvidia won’t report on China

June 13: Nvidia isn’t counting on the U.S. backing off of its AI chip export restrictions anytime soon. After the company took a financial hit from the newly imposed licensing requirements on its H20 AI chips, Nvidia CEO Jensen Huang said the company will no longer include the Chinese market in future revenue and profit forecasts.

AMD acquires the team behind Untether AI

June 6: AMD makes another acquisition — this time focused on talent. The company acqui-hired the team behind Untether AI, which develops AI inference chips, as the semiconductor giant continues to round out its AI offerings.

AMD is coming for Nvidia’s AI hardware dominance

June 4: AMD continued its shopping spree. The company acquired AI software optimization startup Brium, which helps companies retrofit AI software to work with different AI hardware. With a lot of AI software being designed with Nvidia hardware in mind, this acquisition isn’t surprising.

May

Nvidia lays out the impact of chip export restrictions

May 28: Nvidia reported that U.S. licensing requirements on its H20 AI chips cost the company $4.5 billion in charges during Q1. The company expects these requirements to result in an $8 billion hit to Nvidia’s revenue in Q2.

AMD acquires Enosemi

May 28: AMD kicks off its acquisition spree. The semiconductor company announced that it acquired Enosemi, a silicon photonics startup. Enosemi’s tech, which uses light photons to transmit data, is becoming an increasing area of interest for semiconductor companies.

Tensions start to flare between China and the U.S.

May 21: China’s Commerce Secretary didn’t like the U.S.’s guidance, issued on May 13, that warned U.S. companies that using Huawei’s AI chips “anywhere in the world” was a U.S. chip export violation. The commerce secretary issued a statement that threatened legal action against anyone caught enforcing that export restriction.

Intel may be starting to offload its non-core units

May 20: Intel CEO Lip-Bu Tan seemingly got right to work on his plan to spin out Intel’s non-core business units. The semiconductor giant is reportedly looking to offload its networking and edge units, which makes chips for telecom equipment, and was responsible for $5.4 billion of the company’s 2024 revenue.

The Biden administration’s AI Diffusion rule is officially dead

May 13: Just days before the Biden administration’s Artificial Intelligence Diffusion Rule was set to go into place, the U.S. Department of Commerce formally rescinded it. The DOC said that it plans to issue new guidance in the future, and in the meantime companies should remember that using Huawei’s Ascend AI chips anywhere in the world is a violation of U.S. export rules.

A last-minute reversal

May 7: Just a week before the “Framework for Artificial Intelligence Diffusion” was set to go into place, the Trump administration plans on taking a different path. According to multiple media outlets, including Axios and Bloomberg, the administration won’t enforce the restrictions when they were supposed to start on May 15 and is instead working on its own framework.

April

Anthropic doubles down on its support of chip export restrictions

April 30: Anthropic doubled down on its support for restricting U.S.-made chip exports, including some tweaks to the Framework for Artificial Intelligence Diffusion, like imposing further restrictions on Tier 2 countries and dedicating resources to enforcement. An Nvidia spokesperson shot back, saying, “American firms should focus on innovation and rise to the challenge, rather than tell tall tales that large, heavy, and sensitive electronics are somehow smuggled in ‘baby bumps’ or ‘alongside live lobsters.’”

Planned layoffs at Intel

April 22: Ahead of its Q1 earnings call, Intel said it was planning to lay off more than 21,000 employees. The layoffs were meant to streamline management, something CEO Lip-Bu Tan has long said Intel needed to do, and help rebuild the company’s engineering focus.

The Trump administration further restricts chip exports

April 15: Nvidia’s H20 AI chip got hit with an export licensing requirement, the company disclosed in an SEC filing. The company added it expects $5.5 billion in charges related to this new requirement in the first quarter of its 2026 fiscal year. The H20 is the most advanced AI chip Nvidia can still export to China in some form or fashion. TSMC and Intel reported similar expenses the same week.

Nvidia appears to talk its way out of further chip exports

April 9: Nvidia’s CEO Jensen Huang was spotted attending dinner at Donald Trump’s Mar-a-Lago resort, according to reports. At the time, NPR reported Huang may have been able to spare Nvidia’s H20 AI chips from export restrictions upon agreeing to invest in AI data centers in the U.S.

An alleged agreement between Intel and TSMC

April 3: Intel and TSMC allegedly reached a tentative agreement to launch a joint chipmaking venture. This joint venture would operate Intel’s chipmaking facilities, and TSMC would have a 20% stake in the new venture. Both companies declined to comment or confirm. If this deal doesn’t come to fruition, this is likely a decent preview of potential deals in this industry to come.

Intel spins off noncore assets, announces new initiative

April 1: CEO Lip-Bu Tan got to work right away. Just weeks after he joined Intel, the company announced that it was going to spin off noncore assets so it could focus. He also said the company would launch new products, including custom semiconductors for customers.

March

Intel names a new CEO

March 12: Intel announced that industry veteran, and former board member, Lip-Bu Tan would return to the company as CEO on March 18. At the time of his appointment, Tan said Intel would be an “engineering-focused company” under his leadership.

February

Intel’s Ohio chip plant gets delayed again

February 28: Intel was supposed to start operating its first chip fabrication plant in Ohio this year. Instead, the company slowed down construction on the plant for the second time in February. Now the $28 billion semiconductor project won’t wrap up construction until 2030 and may not even open until 2031.

Senators call for more chip export restrictions

February 3: U.S. senators, including Elizabeth Warren (D-Mass) and Josh Hawley (R-Mo), wrote a letter to Commerce Secretary Nominee-Designate Howard Lutnick urging the Trump administration to further restrict AI chip exports. The letter specifically referred to Nvidia’s H20 AI chips, which were used in the training of DeepSeek’s R1 “reasoning” model.

January

DeepSeek releases its open “reasoning” model

January 27: Chinese AI startup DeepSeek caused quite the stir in Silicon Valley when it released the open version of its R1 “reasoning” model. While this isn’t semiconductor news specifically, the sheer alarm in the AI and semiconductor industries DeepSeek’s release caused continues to have ripple effects on the chip industry.

Joe Biden’s executive order on chip exports

January 13: With just a week left in office, former president Joe Biden proposed sweeping new export restrictions on U.S.-made AI chips. This order created a three-tier structure that determined how many U.S. chips can be exported to each country. Under this proposal, Tier 1 countries faced no restrictions; Tier 2 countries had a chip purchase limit for the first time; and Tier 3 countries got additional restrictions.

Anthropic’s Dario Amodei weighs in on chip export restrictions

January 6: Anthropic co-founder and CEO Dario Amodei co-wrote an op-ed in The Wall Street Journal endorsing existing AI chip export controls and pointing to them as a reason why China’s AI market was behind the U.S.’. He also called on incoming president Donald Trump to impose further restrictions and to close loopholes that have allowed AI companies in China to still get their hands on these chips.

bnew · Tuesday at 12:11 PM

1/21
@GoogleDeepMind
We’re bringing powerful AI directly onto robots with Gemini Robotics On-Device.

It’s our first vision-language-action model to help make robots faster, highly efficient, and adaptable to new tasks and environments - without needing a constant internet connection.

https://video.twimg.com/amplify_video/1937508794801487872/vid/avc1/1080x1350/TaBBdjMe2byVQ5LE.mp4

2/21
@GoogleDeepMind
What makes this new model unique?

It has the generality and dexterity of Gemini Robotics - but it can run locally on the device

It can handle a wide variety of complex, two-handed tasks out of the box

It can learn new skills with as few as 50-100 demonstrations

3/21
@GoogleDeepMind
From humanoids to industrial bi-arm robots, the model supports multiple embodiments, even though it was pre-trained on ALOHA - while following instructions from humans.

These tasks may seem easy for us but require fine motor skills, precise manipulation and more. ↓

https://video.twimg.com/amplify_video/1937509533041012737/vid/avc1/1080x1350/iFEUIT-xWqQg-m0S.mp4

4/21
@GoogleDeepMind
We're also launching the Gemini Robotics software development kit (SDK) to help developers fine-tune the model for their own applications, including by testing it in the MuJoCo physics simulator.

https://video.twimg.com/amplify_video/1937509681909407744/vid/avc1/1920x1080/CbQdg18a0ZqAm4Nr.mp4

5/21
@GoogleDeepMind
Our new on-device solution runs independent of a data network - making it optimal for applications needing speed, or situations with poor connectivity.

We’re excited to continue exploring the future of bringing AI into the physical world. Find out more → Gemini Robotics On-Device brings AI to local robotic devices

6/21
@ccharliewu
Awesome

7/21
@LaurenceBrem
Will you open source it?

8/21
@MaxBlazh
Robots now think, see, and act locally.

9/21
@KevinFi69594692
@yacineMTB

10/21
@____Dirt____
Question: Can we bang it?

11/21
@AlvigodOP
Technology truly is the ultimate for peace to the world

12/21
@sageadvik
bc ek baar baith ke poora bna le, har baar kya "put the apple in the basket" ??

13/21
@_fracapuano
@NepYope

14/21
@Prashant_1722
This is incredible right into the future. Google is on a roll this year. Loving it. Gemini

15/21
@MickeySteamboat

16/21
@Caaasy
awesome

17/21
@Kuper_xx
Epico

18/21
@turing_hamster
can it solve a rubik’s cube

19/21
@BensenHsu
Breakdown of the paper behind it:

Title: Gemini Robotics: Bringing AI into the Physical World

The study addresses the challenge of bringing advanced artificial intelligence, particularly large multimodal models that excel in digital tasks, into the physical world to control robots. While these models show impressive general abilities in areas like understanding text and images, making robots truly useful requires them to understand and interact with the physical world competently and safely. This involves what the paper calls "embodied reasoning," which is the common sense humans have about 3D environments, object relationships, and basic physics. Current robots often lack this deep understanding, limiting their ability to perform complex, general tasks.

...

20/21
@TimeLoopx
About time. Still playing catch up in research?

21/21
@purepathwill
Impressive.

These on-device models put Gemini Robotics firmly on track to become the 'Android of Robotics'.

In the limit, OEMs will just need to focus on building the best robotics hardware, and simply use Gemini for the 'brain'.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Serious · Wednesday at 1:02 AM

@null

bnew · Wednesday at 6:29 AM

Serious said:
@null

1/21
@nouhadziri

Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies?

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic

We built a benchmark to find out → OMEGA Ω

We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training.

work w. @UCBerkeley @allen_ai

A thread on what we learned

2/21
@nouhadziri

Inspired by Boden’s creativity framework (1998), OMEGA tests:

Exploratory: Can the model adapt a known algorithm to solve harder variants within the same problem family?

Compositional: Can the model compose familiar skills to solve a novel problem that requires the synergy of those skills?

Transformative: Can the model invent a new unconventional strategy by moving beyond familiar approaches to solve problems more effectively?

3/21
@nouhadziri

You may wonder why we're introducing yet another math benchmark when the community already has plenty. The issue is that most existing benchmarks don't let us precisely control the difficulty or the specific skills each problem requires.
They often contain thousands of problems spanning various types, but without a clear way to attribute success to particular skills or strategies.

OMEGA Ω fills this gap, it's programmatically controlled, diverse across math domains, and designed to isolate and test reasoning skills in detail

4/21
@nouhadziri
You might think that with CoT, reasoning models can solve anything. Sure, performance improves.

BUT the problem persists

: even a slight increase in task complexity causes performance to collapse to 0%.

This suggests that current models may not have fully internalized the underlying algorithms, but rather learned patterns at specific complexity levels

PS: All tested problems were designed to stay within the model’s output window so failures aren’t due to length limits.

5/21
@nouhadziri

We noticed that many failures stem not from lack of knowledge but from overthinking. Models often find the right answer early in CoT, but spiral into self-corrections and abandon correct solutions. This challenges the assumption:

More CoT ≠ better results

Sometimes the models' self-correction mechanisms can inadvertently backfire

6/21
@nouhadziri
And can more inference-time compute solve harder problems?

it Helps at moderate complexity, but Gains plateau at higher levels

Due to budget constraints, we limited testing to 64 attempts, but given the zero performance, we speculate that increasing beyond this point would not help. And even if it does, that's not what we’re after. we want the model to find the solution efficiently, not by blindly exhausting the search space.

7/21
@nouhadziri

Can RL effectively generalize from easy to hard problems? We find strong early gains, but generalization plateaus with task complexity

Training on levels 1–4 gives a solid boost on in-domain problems, e.g. on level 1 (

0.45 → 0.80 after RL).
BUT when we increase the difficulty

performance drops on the same problem family.

There are limits to how far learned strategies can stretch

8/21
@nouhadziri

Can RL learn to compose math skills into integrated solutions?

Strong on isolated skills

Weak on composition: even when models mastered A & B, they failed on A⊕B.

RL strengthens atomic skills but struggles to teach models how to compose.

Unlike humans, who naturally combine what they know to tackle novel problems.

9/21
@nouhadziri

Transformative generalization?
Still out of reach. when success depends on inventing a new solution strategy (e.g., clever symmetry instead of brute force), models consistently fail even after RL.

RL can substantially enhance performance on tasks that follow familiar patterns observed during training, it struggles when success depends on creative insight or reasoning strategies not explicitly demonstrated in the data.

10/21
@nouhadziri
Yes, today’s models can master increasingly difficult math problems. But they often do so within the boundaries of what they’ve seen. Beyond that, they struggle.

Beyond highlighting these limitations, we hope this study encourages the community to explore smarter scaling solutions rather than brute-force approaches.

Those failures could potentially be patched through targeted data augmentation but we need to figure out how to equip models with robust, efficient mathematical reasoning capabilities; ones that go beyond what can be solved through more data or larger models alone.

11/21
@nouhadziri

Paper: OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Blog: OMEGA: Can LLMs Reason Outside the Box in Math? | Ai2

Code & data: GitHub - sunblaze-ucb/math_ood

Thanks to the fantastic people who worked hard on this:
@YiyouSun @dawnsongtweets @HannaHajishirzi et al

12/21
@NeelRajani_
Super cool work! Are there plans to integrate this benchmark with eval harnesses like AI2's OLMES/HuggingFace's LightEval/EleutherAI's LM-Eval-Harness?

13/21
@nouhadziri
It has already been integrated and it will be public soon in OLMES

14/21
@jeremyphoward
Wow can't wait to dig into this - sounds fascinating! :D

15/21
@omgrownm
Ω sounds stellar! Curious: does OMEGA inspect step-by-step chain-of-thought or just final outputs? In agri-autonomy we’ve seen LLMs craft novel heuristics with sensor data, yet still trip on unit-scale arithmetic. Keen to compare methodologies!

16/21
@OnurBerkTore
i hope its not tower of hanoi again!

17/21
@AdamSaltiel
@threadreader unroll

18/21
@HeiligGerh36381
Interesting!

19/21
@JulioISalazarG
Interesting. This study introduces a more general test for AGI, without relaying on human-centric experiments and formally defining the concept in a general approach: SuperARC: A Test for General and Super Intelligence Based on First Principles of Recursion Theory and Algorithmic Probability.
Shorter articles: 𝐀𝐩𝐩𝐥𝐞’𝐬 𝐋𝐚𝐭𝐞𝐬𝐭 𝐀𝐈 𝐏𝐚𝐩𝐞𝐫 is all over the news and social media: "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem… | Hector Zenil FRSM, New study introduces a test for artificial superintelligence.

20/21
@TheQuietNeuron
To start with, simplest ask would be to kindly request some famous LLM to stop using — for any reason at all :smile:

21/21
@WaveTheoryUK
I wrote out this experimental mathematics research in a week and a half using AI/s plural. The only one that made me feel comfortable about grounding in proper theory was 'chatgpt deep research. All other AIs suck up to your wild ideas. GitHub - LindsayRex/riemann: Experimental mathematics code testing the **Universal Critical Restoration conjecture** - a novel reformulation of the Riemann Hypothesis as an energy minimization problem. Instead of proving where zeros must lie, we treat zero configurations as physical systems and measure the energy changes when zeros are perturbed
Repo is 100% AI.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Wednesday at 6:30 AM

1/14
@rohanpaul_ai

The next paper, proving again..

"Frontier LLMs see very sharp performance degradation as problem complexity increases"

LLMs shine on practice problems but freeze when the questions need fresh thinking.

OMEGA builds a controlled test bench that shows exactly where the freeze happens.

OMEGA consists of programmatically generated training–test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods.

The most shocking findings

A)

Collapse under modest complexity: All four top models keep near perfect accuracy at level 1, but even a two-step jump in difficulty drives accuracy toward 0%.

This cliff appears while each problem still fits well inside the context window, proving memory limits are not the cause.

B) Self-sabotage through overthinking: Failures stem from overthinking, not knowledge gaps. Models hit the right answer early in CoT, then overcorrect and ditch it. This upends the idea that more CoT yields better results. Self-correction can backfire.

Roughly one-third of wrong answers begin life as correct solutions, then the model talks itself into a mistake during extra verification.

Another two-thirds stay locked in an incorrect reasoning spiral from the first step and never recover.

A thread

2/14
@rohanpaul_ai

The Core Problem

Recent top models hit Olympiad math scores by reusing the same few tricks, yet they still collapse on puzzles that need a new line of attack.

3/14
@rohanpaul_ai
The picture maps how OMEGA checks three kinds of mental stretch in a language model.

Explorative generalization keeps the same idea but makes the task bigger, like asking for rectangles in a 12-sided shape after training on an 8-sided shape.

Compositional generalization joins two separate skills into one solution, such as finding a common divisor and then using that result to handle a polynomial.

Transformative generalization forces a fresh viewpoint, turning a brute-force listing job into a smarter counting shortcut.

Each training–test pair isolates one leap so researchers know exactly which step breaks.

This layout shows whether a model can scale, fuse, or rethink its learned tricks instead of only copying patterns it has seen.

4/14
@rohanpaul_ai

Benchmark Design

OMEGA supplies program-generated train–test pairs across arithmetic, algebra, combinatorics, number theory, geometry, and logic.

Every problem template carries an adjustable complexity flag, so researchers can turn the difficulty knob without changing the underlying skill being measured.

5/14
@rohanpaul_ai
The picture shows how the benchmark builds tasks that force a model to drop an old habit and pick a fresh strategy instead.

Training questions teach one narrow trick, such as listing a few special cases or tracing one polygon.

The linked test question then demands the same answer type but at a scale or format where the old trick breaks, so the model needs a new counting rule or a different geometric view.

This design isolates “transformative generalization,” the hardest leap in the suite, and makes it clear when the model truly thinks outside its training box.

6/14
@rohanpaul_ai

Three Generalization Axes

Exploratory checks if a model can stretch a known method to a tougher variant, such as counting rectangles in a dodecagon after seeing the octagon case.

Compositional asks the model to fuse two mastered skills, like using a greatest common divisor inside a polynomial root task.

Transformative forces the model to drop the usual path and invent a sharper shortcut when the standard routine fails.

7/14
@rohanpaul_ai

Baseline Model Limits

DeepSeek-R1, Claude 3.7, o3-mini, and o4-mini start near perfect on level 1 but slide to near-zero by level 5, even though all prompts fit easily into their windows.

Many wrong answers begin as correct, then the model talks itself out of the right path or loops in error spirals that burn thousands of tokens.

8/14
@rohanpaul_ai
This figure tracks how language models go wrong on a matrix-rank task as the required steps rise from level 1 to level 6.

Most errors fall into the reasoning-spiral category, meaning the model starts with a faulty plan and never escapes it.

A smaller but steady fraction of errors come from a correct-to-incorrect shift, where the model finds the answer early, then abandons it after extra thinking.

The share of reasoning spirals stays near 100% across all levels, while correct-to-incorrect shifts slide downward as complexity climbs.

These trends support the paper’s claim that more chain-of-thought does not guarantee better answers and can even push the model away from an initial solution.

9/14
@rohanpaul_ai
The figure tracks six kinds of math tasks as their difficulty rises.

Accuracy falls steadily, yet the models often still hit the right answer part-way through the chain of thought.

They then keep talking, add needless checks, and overturn the correct result.

These extra steps burn many more tokens, and wrong answers end up longer than right ones.

The pattern shows that more thinking time can hurt rather than help, especially on harder questions.

10/14
@rohanpaul_ai

Reinforcement Learning Gains and Ceiling

Fine-tuning Qwen 2.5-7B with reward signals lifts accuracy by about 38 points on familiar problems and keeps some lift on slightly harder ones.

The lift stalls once complexity passes the range seen in training, showing RL strengthens habits rather than broad insight.

11/14
@rohanpaul_ai

Composing Skills Trouble

After RL, the model still fumbles when it must weave two learned skills into one solution chain; gains hover near zero on GCD-plus-roots style mixes.

12/14
@rohanpaul_ai

Creativity Roadblock

Tasks that need a brand-new idea, such as swapping brute-force for symmetry tricks, stay unsolved after RL, with accuracy pinned at zero.

13/14
@rohanpaul_ai

Takeaways

OMEGA pinpoints three gaps: models crack under extra steps, fail to blend skills, and cannot jump to novel tactics, even after extensive RL.

14/14
@rohanpaul_ai
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
@HuggingPapers
AllenAI just dropped OMEGA-explorative on Hugging Face!

A new benchmark to test if LLMs can *really* reason in math,
pushing them beyond memorization with increasingly complex problems.

allenai/omega-explorative · Datasets at Hugging Face

2/2
@KieraHur5t
True test: can these LLMs run and adapt securely at the edge?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Wednesday at 6:39 AM

1/1
@rohanpaul_ai
Github

: This is MCP server for Claude that gives it terminal control, file system search and diff file editing capabilities

- Enables Claude Desktop app interaction with your local filesystem and terminal via MCP.

- Executes terminal commands with output streaming and background process management.

- Offers comprehensive filesystem operations including file read/write, directory management, and search.

- Facilitates code editing using precise, diff-based replacements or complete file overwrites across multiple files.

- Integrates ripgrep for powerful recursive code and text search within project folders.

- Allows listing and terminating system processes directly through Claude.

----------------------------

github. com/wonderwhy-er/DesktopCommanderMCP

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Wednesday at 6:39 AM

1/1
@rohanpaul_ai
Github

: Agent S: an open agentic framework that uses computers like a human

- Provides an open-source framework (Agent S/S2) for creating autonomous GUI agents to interact with computer interfaces.

- Implements Agent S2, a compositional generalist-specialist model achieving state-of-the-art results on OSWorld, WindowsAgentArena, AndroidWorld benchmarks.

- Offers the `gui-agents` library (SDK) and CLI for developing and deploying agents across Mac, Linux, Windows.

- Supports various LLMs (OpenAI, Anthropic, HuggingFace TGI endpoints) for core logic and visual grounding.

- Integrates web search via Perplexica for enhanced knowledge retrieval during agent operation.

- Employs a dynamic knowledge base updated during inference, specific to platform and agent version.

- Translates agent goals into executable Python code for direct computer control.

----------------------------

github. com/simular-ai/Agent-S

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

Superstar

...

Veteran

Anthropic says most AI models, not just Claude, will resort to blackmail​

Veteran

Deezer starts labeling AI-generated music to tackle streaming fraud​

Veteran

A timeline of the US semiconductor market in 2025​

June​

Intel appoints new leadership​

Intel to begin layoffs​

Nvidia won’t report on China​

AMD acquires the team behind Untether AI​

AMD is coming for Nvidia’s AI hardware dominance​

May​

Nvidia lays out the impact of chip export restrictions​

AMD acquires Enosemi​

Tensions start to flare between China and the U.S.​

Intel may be starting to offload its non-core units​

The Biden administration’s AI Diffusion rule is officially dead​

A last-minute reversal​

April​

Anthropic doubles down on its support of chip export restrictions​

Planned layoffs at Intel​

The Trump administration further restricts chip exports​

Nvidia appears to talk its way out of further chip exports​

An alleged agreement between Intel and TSMC​

Intel spins off noncore assets, announces new initiative​

March​

Intel names a new CEO​

February​

Intel’s Ohio chip plant gets delayed again​

Senators call for more chip export restrictions​

January​

DeepSeek releases its open “reasoning” model​

Joe Biden’s executive order on chip exports​

Anthropic’s Dario Amodei weighs in on chip export restrictions​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Anthropic says most AI models, not just Claude, will resort to blackmail

Deezer starts labeling AI-generated music to tackle streaming fraud

A timeline of the US semiconductor market in 2025

June

Intel appoints new leadership

Intel to begin layoffs

Nvidia won’t report on China

AMD acquires the team behind Untether AI

AMD is coming for Nvidia’s AI hardware dominance

May

Nvidia lays out the impact of chip export restrictions

AMD acquires Enosemi

Tensions start to flare between China and the U.S.

Intel may be starting to offload its non-core units

The Biden administration’s AI Diffusion rule is officially dead

A last-minute reversal

April

Anthropic doubles down on its support of chip export restrictions

Planned layoffs at Intel

The Trump administration further restricts chip exports

Nvidia appears to talk its way out of further chip exports

An alleged agreement between Intel and TSMC

Intel spins off noncore assets, announces new initiative

March

Intel names a new CEO

February

Intel’s Ohio chip plant gets delayed again

Senators call for more chip export restrictions

January

DeepSeek releases its open “reasoning” model

Joe Biden’s executive order on chip exports

Anthropic’s Dario Amodei weighs in on chip export restrictions