1/14
@rohanpaul_ai

The next paper, proving again..
"Frontier LLMs see very sharp performance degradation as problem complexity increases"
LLMs shine on practice problems but freeze when the questions need fresh thinking.
OMEGA builds a controlled test bench that shows exactly where the freeze happens.
OMEGA consists of programmatically generated training–test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods.
The most shocking findings
A)

Collapse under modest complexity: All four top models keep near perfect accuracy at level 1, but even a two-step jump in difficulty drives accuracy toward 0%.
This cliff appears while each problem still fits well inside the context window, proving memory limits are not the cause.
B) Self-sabotage through overthinking: Failures stem from overthinking, not knowledge gaps. Models hit the right answer early in CoT, then overcorrect and ditch it. This upends the idea that more CoT yields better results. Self-correction can backfire.
Roughly one-third of wrong answers begin life as correct solutions, then the model talks itself into a mistake during extra verification.
Another two-thirds stay locked in an incorrect reasoning spiral from the first step and never recover.

A thread
2/14
@rohanpaul_ai

The Core Problem
Recent top models hit Olympiad math scores by reusing the same few tricks, yet they still collapse on puzzles that need a new line of attack.
3/14
@rohanpaul_ai
The picture maps how OMEGA checks three kinds of mental stretch in a language model.
Explorative generalization keeps the same idea but makes the task bigger, like asking for rectangles in a 12-sided shape after training on an 8-sided shape.
Compositional generalization joins two separate skills into one solution, such as finding a common divisor and then using that result to handle a polynomial.
Transformative generalization forces a fresh viewpoint, turning a brute-force listing job into a smarter counting shortcut.
Each training–test pair isolates one leap so researchers know exactly which step breaks.
This layout shows whether a model can scale, fuse, or rethink its learned tricks instead of only copying patterns it has seen.
4/14
@rohanpaul_ai

Benchmark Design
OMEGA supplies program-generated train–test pairs across arithmetic, algebra, combinatorics, number theory, geometry, and logic.
Every problem template carries an adjustable complexity flag, so researchers can turn the difficulty knob without changing the underlying skill being measured.
5/14
@rohanpaul_ai
The picture shows how the benchmark builds tasks that force a model to drop an old habit and pick a fresh strategy instead.
Training questions teach one narrow trick, such as listing a few special cases or tracing one polygon.
The linked test question then demands the same answer type but at a scale or format where the old trick breaks, so the model needs a new counting rule or a different geometric view.
This design isolates “transformative generalization,” the hardest leap in the suite, and makes it clear when the model truly thinks outside its training box.
6/14
@rohanpaul_ai

Three Generalization Axes
Exploratory checks if a model can stretch a known method to a tougher variant, such as counting rectangles in a dodecagon after seeing the octagon case.
Compositional asks the model to fuse two mastered skills, like using a greatest common divisor inside a polynomial root task.
Transformative forces the model to drop the usual path and invent a sharper shortcut when the standard routine fails.
7/14
@rohanpaul_ai

Baseline Model Limits
DeepSeek-R1, Claude 3.7, o3-mini, and o4-mini start near perfect on level 1 but slide to near-zero by level 5, even though all prompts fit easily into their windows.
Many wrong answers begin as correct, then the model talks itself out of the right path or loops in error spirals that burn thousands of tokens.
8/14
@rohanpaul_ai
This figure tracks how language models go wrong on a matrix-rank task as the required steps rise from level 1 to level 6.
Most errors fall into the reasoning-spiral category, meaning the model starts with a faulty plan and never escapes it.
A smaller but steady fraction of errors come from a correct-to-incorrect shift, where the model finds the answer early, then abandons it after extra thinking.
The share of reasoning spirals stays near 100% across all levels, while correct-to-incorrect shifts slide downward as complexity climbs.
These trends support the paper’s claim that more chain-of-thought does not guarantee better answers and can even push the model away from an initial solution.
9/14
@rohanpaul_ai
The figure tracks six kinds of math tasks as their difficulty rises.
Accuracy falls steadily, yet the models often still hit the right answer part-way through the chain of thought.
They then keep talking, add needless checks, and overturn the correct result.
These extra steps burn many more tokens, and wrong answers end up longer than right ones.
The pattern shows that more thinking time can hurt rather than help, especially on harder questions.
10/14
@rohanpaul_ai

Reinforcement Learning Gains and Ceiling
Fine-tuning Qwen 2.5-7B with reward signals lifts accuracy by about 38 points on familiar problems and keeps some lift on slightly harder ones.
The lift stalls once complexity passes the range seen in training, showing RL strengthens habits rather than broad insight.
11/14
@rohanpaul_ai

Composing Skills Trouble
After RL, the model still fumbles when it must weave two learned skills into one solution chain; gains hover near zero on GCD-plus-roots style mixes.
12/14
@rohanpaul_ai

Creativity Roadblock
Tasks that need a brand-new idea, such as swapping brute-force for symmetry tricks, stay unsolved after RL, with accuracy pinned at zero.
13/14
@rohanpaul_ai

Takeaways
OMEGA pinpoints three gaps: models crack under extra steps, fail to blend skills, and cannot jump to novel tactics, even after extensive RL.
14/14
@rohanpaul_ai
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196