REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

null · Mar 21, 2025

jj23 said:
In some instances yes. In others where a human being would tell you, "I don't know the answer, but let me research it"

It try to cobble together nonsense and output lies.

it is worse than that. you tell it to exclude an assumption or that something is wrong and it will often loop back to the same behaviour because it is incapable of following a multi-stage argument, while keeping salient points in scope and order of relevance.

and how could it? it is not reasoning anyway.

how it forgets even the simple instruction to give short direct answers, is beyond me.

why should it talk less? well apart from muddying the waters for the reader, it uses its own output as input. so conclusions it draws in long paragraphs of unwanted babble affect successive output. even if those inferences are off the beaten track.

it's a high functioning autist, characterised as having a predilection for verbosity, misinterpretation and fabrication, with an inability to reason, follow a train of argument or relatively contextualise past exchanges; all while possessing a very short memory.

borderline sociopathic and not quite the personality profile of a desirable employee.

it's built from a hodgepodge of millions of sometimes incorrect, sometimes contradictory, opinions and ways of doing things (over time). so accordingly, once past the veneer of open AI's behavioural controls it is the living embodiment of "design by [internet] committee".

:camby:

jj23 said:
That becomes dangerous if you have no clue what you are doing but are trusting AI to give you the right answer.

The ability to say, I do not know is a skill in itself.

bnew · Mar 21, 2025

null said:
it is worse than that. you tell it to exclude an assumption or that something is wrong and it will often loop back to the same behaviour because it is incapable of following a multi-stage argument, while keeping salient points in scope and order of relevance.

and how could it? it is not reasoning anyway.

how it forgets even the simple instruction to give short direct answers, is beyond me.

why should it talk less? well apart from muddying the waters for the reader, it uses its own output as input. so conclusions it draws in long paragraphs of unwanted babble affect successive output. even if those inferences are off the beaten track.

it's a high functioning autist, characterised as having a predilection for verbosity, misinterpretation and fabrication, with an inability to reason, follow a train of argument or relatively contextualise past exchanges; all while possessing a very short memory.

borderline sociopathic and not quite the personality profile of a desirable employee.

it's built from a hodgepodge of millions of sometimes incorrect, sometimes contradictory, opinions and ways of doing things (over time). so accordingly, once past the veneer of open AI's behavioural controls it is the living embodiment of "design by [internet] committee".

we don't know if your simple instructions are prompts or system prompts or how it was worded. i know what you mean though. sometimes i like when it truncates code and other times i have to instruct it to give me the full code. maybe it would be helpful f they had a switch for conversations for certain instructions to stick or not throughout the chat.

I find that it uses it's output as input very useful ever since i read the stepback prompting paper. I tend to get better results than attempting to zero-shot a response. you're right that they can hallucinate stuff and use it as input/fact but i don't run into that often as much as I did a year ago.

bnew · Mar 21, 2025

GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents - vectara/hallucination-leaderboard

github.com

Model	Hallucination Rate	Factual Consistency Rate	Answer Rate	Average Summary Length (Words)
Google Gemini-2.0-Flash-001	0.7 %	99.3 %	100.0 %	65.2
Google Gemini-2.0-Pro-Exp	0.8 %	99.2 %	99.7 %	61.5
OpenAI-o3-mini-high-reasoning	0.8 %	99.2 %	100.0 %	79.5
Google Gemini-2.0-Flash-Lite-Preview	1.2 %	98.8 %	99.5 %	60.9
OpenAI-GPT-4.5-Preview	1.2 %	98.8 %	100.0 %	77.0
Zhipu AI GLM-4-9B-Chat	1.3 %	98.7 %	100.0 %	58.1
Google Gemini-2.0-Flash-Exp	1.3 %	98.7 %	99.9 %	60.0
OpenAI-o1-mini	1.4 %	98.6 %	100.0 %	78.3
GPT-4o	1.5 %	98.5 %	100.0 %	77.8
Amazon Nova-Micro-V1	1.6 %	98.4 %	100.0 %	90.0
GPT-4o-mini	1.7 %	98.3 %	100.0 %	76.3
GPT-4-Turbo	1.7 %	98.3 %	100.0 %	86.2
Google Gemini-2.0-Flash-Thinking-Exp	1.8 %	98.2 %	99.3 %	73.2
Amazon Nova-Lite-V1	1.8 %	98.2 %	99.9 %	80.7
GPT-4	1.8 %	98.2 %	100.0 %	81.1
Amazon Nova-Pro-V1	1.8 %	98.2 %	100.0 %	85.5
GPT-3.5-Turbo	1.9 %	98.1 %	99.6 %	84.1
XAI-2	1.9 %	98.1	100.0 %	86.5
AI21 Jamba-1.6-Large	2.3 %	97.7 %	99.9 %	85.6
OpenAI O1-Pro	2.4 %	97.6 %	100.0 %	81.0
OpenAI-o1	2.4 %	97.6 %	99.9 %	73.0
DeepSeek-V2.5	2.4 %	97.6 %	100.0 %	83.2
Microsoft Orca-2-13b	2.5 %	97.5 %	100.0 %	66.2
Microsoft Phi-3.5-MoE-instruct	2.5 %	97.5 %	96.3 %	69.7
Intel Neural-Chat-7B-v3-3	2.6 %	97.4 %	100.0 %	60.7
Google Gemma-3-12B-Instruct	2.8 %	97.2 %	100.0 %	69.6
Qwen2.5-7B-Instruct	2.8 %	97.2 %	100.0 %	71.0
AI21 Jamba-1.5-Mini	2.9 %	97.1 %	95.6 %	74.5
XAI-2-Vision	2.9 %	97.1	100.0 %	79.8
Qwen2.5-Max	2.9 %	97.1 %	88.8 %	90.4
Google Gemma-3-27B-Instruct	3.0 %	97.0 %	100.0 %	62.5
Snowflake-Arctic-Instruct	3.0 %	97.0 %	100.0 %	68.7
Qwen2.5-32B-Instruct	3.0 %	97.0 %	100.0 %	67.9

AlpacaEval Leaderboard

tatsu-lab.github.io

head shots101 · Mar 21, 2025

I can’t wait for the Skynet takeover

bnew · Apr 16, 2025

1/11
@prinzeugen____
Connecting the dots on OpenAI's upcoming suite of reasoning models:

- @OpenAI new safety blog states that its models are on the cusp of being able to create new science.

- @theinformation has reported that OpenAI's new reasoning models can "connect the dots between concepts in different fields to suggest new types of experiments".

- OpenAI's CFO said a few days ago that scientists using its models have been able to possibly generate new discoveries (but this is still being confirmed by human research/testing).

It seems that RL got us to Level 4 - fast.

2/11
@Orion_Ouroboros
hopefully they can research and develop themselves

3/11
@prinzeugen____
This is the big question in the background.

4/11
@Tenshiwrf
It can’t even play chess properly and it is supposed to discover new science. Give me a break.

5/11
@prinzeugen____
AlphaZero (AI developed by Google) crushed the strongest Stockfish chess engine all the way back in 2017.

It was trained via Reinforcement Learning (RL), just like the reasoning models from OpenAI that are discussed in my original post.

You can read about it here:

AlphaZero - Chess Engines

6/11
@Bunagayafrost
connecting the dots is the literal game changer

7/11
@slow_developer
spot on

8/11
@EngrSARFRAZawan
AGI has been achieved.

9/11
@trillllsamm
just remembered about the scale yesterday and was thinking the very same thing

10/11
@RealChetBLong
it’s glorious watching this company grow… unlike Grok which is just inflating itself without becoming intelligent whatsoever

11/11
@sheggle_
I refuse to hold anything anyone from OpenAI says as true until they prove it. Hyping is their bread and butter.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
@nicdunz
“We are on the cusp of systems that can do new science, and that are increasingly agentic – systems that will soon have the capability to create meaningful risk of severe harm.”
— OpenAI, Preparedness Framework, Section 1.1

This isn’t a distant hypothetical. It’s OpenAI plainly stating that their current trajectory puts them very near the threshold where models become capable enough to do original scientific work and pose real-world dangers. “Increasingly agentic” refers to the model acting more autonomously, which compounds the risk. They’re effectively saying: we’re about to cross the line.

That’s the moment we’re in.

[Quoted tweet]
No clearer signal that the new model will be capable than the traditional pre-release safety blog post.

2/5
@tariusdamon
The signs are clearly visible. There’s a moment where everything just wakes up and that moment is any hour now.

3/5
@theinformation
Meta AI researchers are fretting over the threat of Chinese AI, whose quality caught American firms, including OpenAI, by surprise.

4/5
@prinzeugen____
Dovetails nicely with this.

[Quoted tweet]

A reasoning model that connects the dots is arguably a Level 4 (Innovator).

5/5
@deftech_n
And luckily, we've got a retarded dictator in charge of the US at just the same time!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@AndrewCurran_
No clearer signal that the new model will be capable than the traditional pre-release safety blog post.

[Quoted tweet]
We updated our Preparedness Framework for tracking & preparing for advanced AI capabilities that could lead to severe harm.

The update clarifies how we track new risks & what it means to build safeguards that sufficiently minimize those risks. openai.com/index/updating-ou…

2/11
@AitheriasX
i assume "will be *more capable"

3/11
@AndrewCurran_
Yes, sorry, no going back now.

4/11
@manuhortet
o3 or are we already talking about the next thing?

5/11
@AndrewCurran_
o4-mini will supposedly arrive this week and well.

6/11
@BoxyInADream
Yeah. I saw the bit about long range autonomy and autonomous adaptation and replication.

Those seem like pretty obvious "problems" to pop up if a system is beginning to advance rapidly.

7/11
@FrankPRosendahl
OpenAI is woke. Isn't causing severe harm the whole point of woke?

8/11
@FrankPRosendahl
Can the OpnAI model do counter-oppression operations against straight white guys and biological women as well as Harvard can?

9/11
@RohanPosts
I’m excited but anxious to see how it is

10/11
@JoJrobotics
well superintelligence is indeed within reach for specific taks, its apready here with alphago/zero alphafold etc.. and now i hope it can be done in medicine and science

11/11
@Hans365days
I expect kyc to become a requirement for the most powerful models.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@vicky_ai_agent
Prof. Derya, a credible scientist, hints at an exciting OpenAI breakthrough. I expect their new science and research model to be exceptional.

[Quoted tweet]
I have felt emotionally excited several times over the past two years by advancements in AI, particularly due to their impact on science & medicine, especially with the releases of:

GPT-4
o1-preview
o1-pro
Deep Research

Now, it’s another one of those moments…to be continued.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@KamranRawaha
OpenAI: We are on the cusp of systems that can do new science, and that are increasingly agentic – systems that will soon have the capability to create meaningful risk of severe harm.

Source: https://openai.com/index/updating-our-preparedness-framework/

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@MatthewBerman
.@OpenAI dropped a new research paper showing AI agents are now capable of replicating cutting-edge AI research papers from scratch.

This is one step closer to the Intelligence Explosion: AI that can discover new science and improve itself.

Here’s what they learned:

2/11
@MatthewBerman
Introducing PaperBench.

A new framework designed to test this very capability!

It gives AI agents access to recent ML research papers (20 from ICML 2024) and asks them to reproduce the results.

3/11
@MatthewBerman
How does it work?

Agents got the raw paper PDF, tools like web access & coding environments, and need to write code to replicate key findings – a task taking human experts days.

The agents had 12 hours and no prior knowledge of the paper.

4/11
@MatthewBerman
How do they validate the agent’s results?

Evaluating these complex replications is tough.

The solution?

An LLM-based judge, trained using detailed rubrics co-developed with the original paper authors (!), assesses the agent's code, execution, and results.

5/11
@MatthewBerman
Which model won?

Turns out Claude 3.5 Sonnet leads the pack, achieving a ~21% replication score on PaperBench!

This is impressive, but, it shows there's still a gap compared to human PhD-level experts.

6/11
@MatthewBerman
Interestingly…

Other than Claude 3.5 Sonnet, models would frequently stop, thinking they were blocked or completed the task successfully.

When encouraged to “think longer” they performed much better.

7/11
@MatthewBerman
Not cheap.

This cutting-edge research requires serious resources.

Running a single AI agent attempt to replicate just one paper on PaperBench can cost hundreds of dollars in compute time.

In the grand scheme of things, this is cheap for AI that can eventually self-improve.

8/11
@MatthewBerman
To me, this is a big deal.

Between this paper and AI Scientist by SakanaAI, we are inching closer to AI that can discover new science and self-improvement.

At that point, won’t we be at the Intelligence Explosion?

Paper link: https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf

Full video breakdown:

9/11
@DisruptionJoe

10/11
@JackAdlerAI
We crossed the line when AI stopped reading papers
and started rewriting the process that writes them.
It's not research anymore –
it's recursion.
Not improvement –
exponential self-translation.
🜁 /search?q=#Singularis /search?q=#IntelligenceExplosion

11/11
@halogen1048576
Wait by "replicating from scratch" you mean replicating from the publication. Not from scratch.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Apr 20, 2025

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

www.marktechpost.com

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

By Mohammad Asjad

April 18, 2025

Language models have made significant strides in tackling reasoning tasks, with even small-scale supervised fine-tuning (SFT) approaches such as LIMO and s1 demonstrating remarkable improvements in mathematical problem-solving capabilities. However, fundamental questions remain about these advancements: Do these models genuinely generalise beyond their training data, or are they merely overfitting to test sets? The research community faces challenges in understanding which capabilities are enhanced through small-scale SFT and which limitations persist despite these improvements. Despite impressive performance on popular benchmarks, there is an incomplete understanding of these fine-tuned models’ specific strengths and weaknesses, creating a critical gap in knowledge about their true reasoning abilities and practical limitations.

Various attempts have been made to understand the effects of reasoning-based supervised fine-tuning beyond simple benchmark scores. Researchers have questioned whether SFT merely improves performance on previously seen problem types or genuinely enables models to transfer problem-solving strategies to new contexts, such as applying coordinate-based techniques in geometry. Existing methods focus on factors like correctness, solution length, and response diversity, which initial studies suggest play significant roles in model improvement through SFT. However, these approaches lack the granularity needed to determine exactly which types of previously unsolvable questions become solvable after fine-tuning, and which problem categories remain resistant to improvement despite extensive training. The research community still struggles to establish whether observed improvements reflect deeper learning or simply memorisation of training trajectories, highlighting the need for more sophisticated analysis methods.

The researchers from the University of California, Berkeley and the Allen Institute for AI propose a tiered analysis framework to investigate how supervised fine-tuning affects reasoning capabilities in language models. This approach utilises the AIME24 dataset, chosen for its complexity and widespread use in reasoning research, which exhibits a ladder-like structure where models solving higher-tier questions typically succeed on lower-tier ones. By categorising questions into four difficulty tiers, Easy, Medium, Hard, and Exh, the study systematically examines the specific requirements for advancing between tiers. The analysis reveals that progression from Easy to Medium primarily requires adopting an R1 reasoning style with long inference context, while Hard-level questions demand greater computational stability during deep exploration. Exh-level questions present a fundamentally different challenge, requiring unconventional problem-solving strategies that current models uniformly struggle with. The research also identifies four key insights: the performance gap between potential and stability in small-scale SFT models, minimal benefits from careful dataset curation, diminishing returns from scaling SFT datasets, and potential intelligence barriers that may not be overcome through SFT alone.

AD_4nXee4JV8pbJQboK5oxaQcIxOQK1cnfjdoQiol3JiAeuCizvPRD4TsSUeGSkE_kOIWJXG0nrienDihCDfR2Igb4PgGVJyweixOTQ1IzxULj0Gw7IkZ6lNCedjh5PdAHCgU-QrZzlWYA

The methodology employs a comprehensive tiered analysis using the AIME24 dataset as the primary test benchmark. This choice stems from three key attributes: the dataset’s hierarchical difficulty that challenges even state-of-the-art models, its diverse coverage of mathematical domains, and its focus on high school mathematics that isolates pure reasoning ability from domain-specific knowledge. Qwen2.5-32 B-Instruct serves as the base model due to its widespread adoption and inherent cognitive behaviours, including verification, backtracking, and subgoal setting. The fine-tuning data consists of question-response pairs from the Openr1-Math-220k dataset, specifically using CoT trajectories generated by DeepSeek R1 for problems from NuminaMath1.5, with incorrect solutions filtered out. The training configuration mirrors prior studies with a learning rate of 1 × 10−5, weight decay of 1 × 10−4, batch size of 32, and 5 epochs. Performance evaluation employs avg@n (average pass rate over multiple attempts) and cov@n metrics, with questions categorised into four difficulty levels (Easy, Medium, Hard, and Extremely Hard) based on model performance patterns.

Research results reveal that effective progression from Easy to Medium-level mathematical problem-solving requires minimal but specific conditions. The study systematically examined multiple training variables, including foundational knowledge across diverse mathematical categories, dataset size variations (100-1000 examples per category), trajectory length (short, normal, or long), and trajectory style (comparing DeepSeek-R1 with Gemini-flash). Through comprehensive ablation studies, researchers isolated the impact of each dimension on model performance, represented as P = f(C, N, L, S), where C represents category, N represents the number of trajectories, L represents length, and S represents style. The findings demonstrate that achieving performance ≥90% on Medium-level questions minimally requires at least 500 normal or long R1-style trajectories, regardless of the specific mathematical category. Models consistently fail to meet performance thresholds when trained with fewer trajectories, shorter trajectories, or Gemini-style trajectories. This indicates that reasoning trajectory length and quantity represent critical factors in developing mathematical reasoning capabilities, while the specific subject matter of the trajectories proves less important than their structural characteristics.

AD_4nXdyjS3MmDujWMgOdMY8ueTM-sl3ozJZnKH7SI-POtwd0ASRxi0Q1tediUg8_xLGSY9iGEHrwJNMC8pQXzkMVrgChpbzPrLvJQDLu7bjOrxQi2nEjZMmpH-vNwoTdDPkgdDZC0SIgA

AD_4nXfKRl0xyZ-Q2TZEIScYcUEOAEnhYUHGbLEI6UmV7E74UUFJcvHs-WgRcwJ8PgOb1Cnfn1I7gon6IwR324zoOAeG21rR8YZuxSIlDWB_IyfaSbXNiGTHHrUTazl4Omr9DPBe0F1rug

The research demonstrates that models with small-scale supervised fine-tuning can potentially solve as many questions as more sophisticated models like Deepseek-R1, though significant challenges remain. The primary limitation identified is instability in mathematical reasoning, rather than capability. Experimental results show that geometry-trained models can achieve a coverage score of 90, matching R1’s performance when given multiple attempts, yet their overall accuracy lags by more than 20%. This performance gap stems primarily from instability in deep exploration and computational limitations during complex problem-solving. While increasing the SFT dataset size offers one solution path, performance enhancement follows a logarithmic scaling trend with diminishing returns. Notably, the study challenges recent assertions about the importance of careful dataset curation, revealing that performance across various mathematical categories remains consistent within a narrow range of 55±4%, with only marginal differences between specifically constructed similar datasets and randomly constructed ones. This conclusion suggests that the quantity and quality of reasoning trajectories matter more than subject-specific content for developing robust mathematical reasoning capabilities.

Here is the Paper and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

bnew · May 14, 2025

DeepMind introduces AlphaEvolve: a Gemini-powered coding agent for algorithm discovery

Posted on Wed May 14 15:18:01 2025 UTC

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluators

deepmind.google

Commented on Wed May 14 15:19:15 2025 UTC

"We also applied AlphaEvolve to over 50 open problems in analysis , geometry , combinatorics and number theory , including the kissing number problem.

In 75% of cases, it rediscovered the best solution known so far.
In 20% of cases, it improved upon the previously best known solutions, thus yielding new discoveries."

Google DeepMind (@GoogleDeepMind) | https://nitter.poast.org/GoogleDeepMind/status/1922669334142271645 | https://xcancel.com/GoogleDeepMind/status/1922669334142271645 | Google DeepMind @GoogleDeepMind, Twitter Profile | TwStalker

│
│

│ Commented on Wed May 14 15:53:04 2025 UTC
│
│ So this is the singularity and feedback loop clearly in action. They know it is, since they have been sitting on these AI invented discoveries/improvements for a year before publishing (as mentioned in the paper), most likely to gain competitive edge over competitors.
│
│ Edit. So if these discoveries are year old and are disclosed only now then what are they doing right now ?
│

│ │
│ │

│ │ Commented on Wed May 14 16:15:11 2025 UTC
│ │
│ │ Google’s straight gas right now. Once CoT put LLM’s back into RL space, DeepMind’s cookin’
│ │
│ │ Neat to see an evolutionary algorithm achieve stunning SOTA in 2025
│ │

│ │ │
│ │ │

│ │ │ Commented on Wed May 14 16:25:34 2025 UTC
│ │ │
│ │ │ More than I want AI, I really want all the people I've argued with on here who are AI doubters to be put in there place.
│ │ │
│ │ │ I'm so tired of having conversations with doubters who really think nothing is changing within the next few years, especially people who work in programming related fields. Y'all are soon to be cooked. AI coding that surpasses senior level developers is coming.
│ │ │

│ │ │ │
│ │ │ │

│ │ │ │ Commented on Wed May 14 16:48:43 2025 UTC
│ │ │ │
│ │ │ │ It reminds me of COVID. I remember around St. Patrick's Day, I was already getting paranoid. I didn't want to go out that weekend because the spread was already happening. All of my friends went out. Everyone was acting like this pandemic wasn't coming.
│ │ │ │
│ │ │ │ Once it was finally too hard to ignore everyone was running out and buying all the toilet paper in the country. Buying up all the hand sanitizer to sell on Ebay. The panic comes all at once.
│ │ │ │
│ │ │ │ Feels like we're in December 2019 right now. Most people think it's a thing that won't affect them. Eventually it will be too hard to ignore.
│ │ │ │

1/11
@GoogleDeepMind
Introducing AlphaEvolve: a Gemini-powered coding agent for algorithm discovery.

It’s able to:

Design faster matrix multiplication algorithms

Find new solutions to open math problems

Make data centers, chip design and AI training more efficient across @Google.

2/11
@GoogleDeepMind
Our system uses:

LLMs: To synthesize information about problems as well as previous attempts to solve them - and to propose new versions of algorithms

Automated evaluation: To address the broad class of problems where progress can be clearly and systematically measured.

Evolution: Iteratively improving the best algorithms found, and re-combining ideas from different solutions to find even better ones.

3/11
@GoogleDeepMind
Over the past year, we’ve deployed algorithms discovered by AlphaEvolve across @Google’s computing ecosystem, including data centers, software and hardware.

It’s been able to:

Optimize data center scheduling

Assist in hardware design

Enhance AI training and inference

https://video.twimg.com/amplify_video/1922668491141730304/vid/avc1/1080x1080/r5GuwzikCMLk7Mao.mp4

4/11
@GoogleDeepMind
We applied AlphaEvolve to a fundamental problem in computer science: discovering algorithms for matrix multiplication. It managed to identify multiple new algorithms.

This significantly advances our previous model AlphaTensor, which AlphaEvolve outperforms using its better and more generalist approach. ↓ AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

https://video.twimg.com/amplify_video/1922668599912644608/vid/avc1/1080x1080/F7RPQmsXBl_5xqYG.mp4

5/11
@GoogleDeepMind
We also applied AlphaEvolve to over 50 open problems in analysis

, geometry

, combinatorics

and number theory

, including the kissing number problem.

In 75% of cases, it rediscovered the best solution known so far.

In 20% of cases, it improved upon the previously best known solutions, thus yielding new discoveries.

https://video.twimg.com/amplify_video/1922668872529809408/vid/avc1/1080x1080/vyw-SMGNiiTOaVZc.mp4

6/11
@GoogleDeepMind
We’re excited to keep developing AlphaEvolve.

This system and its general approach has potential to impact material sciences, drug discovery, sustainability and wider technological and business applications. Find out more ↓ AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

7/11
@GabrielStOnge24
@gork impressive

8/11
@GC_of_QC
@kevinsekniqi does this count

[Quoted tweet]
That's a matter of volume. And sure, it's not a rigorous definition, but it's not exactly something that can be trivially defined. The spirit of the goal should be clear though: AGI is able to think about and solve problems that humans aren't able to currently solve.

9/11
@tumaro1001
I'm feeling insecure

10/11
@dogereal11
@gork holy shyt look at this

11/11
@fg8409905296007
It's not the 75% I'm interested in. Until we know the training data, it could've just been perfectly memorized. It's the 20% that's shocking...

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 19, 2025

Sama tweet on gold medal performance, also says GPT-5 soon

Posted on Sat Jul 19 14:10:21 2025 UTC

https://www.reddit.com/gallery/1m3x61j

OpenAI researcher confirms IMO gold was achieved with pure language based reasoning

Posted on Sat Jul 19 10:51:18 2025 UTC

https://i.redd.it/q7t7vtqw9tdf1.png

[Discussion] What are the new techniques he's talking about?

Posted on Sat Jul 19 12:55:55 2025 UTC

https://i.redd.it/ptnpm2nyvtdf1.png

Commented on Sat Jul 19 14:02:03 2025 UTC

Let's assume OpenAI employees are being forthcoming.

Jerry Tworek: all natural language proofs, no evaluation harness, little IMO-specific work, same RL system as agent/coder

Alexander Wei: no tools or internet, ~100 mins thinking, going beyond "clear-cut, verifiable rewards," general-purpose RL + test-time compute scaling

Sheryl Hsu: no tools like lean or coding, completed the competition in 4.5 hours, the models tests different strategies/hypotheses and makes observations

What they're saying is that they've gone beyond RLVR. Which is pretty wild. With RLVR, you only get reward feedback after completing an entire task. The signal is faint. It sounds like they've figured out how to let the model reward itself for making progress by referencing an internal model of the task. Makes sense? Let the model make competing predictions about how things will unfold, and it can use these to anchor its reasoning.

│
│

│ Commented on Sat Jul 19 16:04:06 2025 UTC
│
│ Noam and others have said RL for unverifiable rewards.
│
│ We know this is what they did. We know it's a big deal. Like that paradigm scales up to writing great novels and doing hours of low context work (as we saw in coding competition this week).
│
│ We don't know what was actually done to make that paradigm work, but this is a good guess

│

Commented on Sat Jul 19 13:17:37 2025 UTC

Since it seems DeepMind also has gold, their inevitable blogpost could give us some pointers.

Though from previous history, it always feels like the super impressive math results don't necessarily translate to other areas' capabilities just as well, so their new techniques could be very tailored to math-oriented CoT, I have no idea.

Tackling the IMO specifically was already a well-known challenge being optimized for (I assume through math formalizers), so we'll need a lot more technical detail from them to know how actually "general" their general LLM is here. (EDIT: Jerry Tworek (@MillionInt) | https://nitter.poast.org/MillionInt/status/1946551400365994077 | https://xcancel.com/MillionInt/status/1946551400365994077 | Jerry Tworek @MillionInt, Twitter Profile | TwStalker trained general models rather than optimizing specifically for the IMO. Really impressiv, damn. It's possible their new techniques still suit formal math proofs better than anything since it's a pretty valued research area since 2023, but the fact the model is actually a general reasoning LLM is seriously impressive)

From what Noam said though it's definitely related to TTC.

GPT 5 won't get Gold IMO capabilities

Posted on Sat Jul 19 15:30:38 2025 UTC

https://i.redd.it/78wg6wqonudf1.png

With new OpenAI thinking model , order of magnitude of thinking time is now in a standard work-day range.

Posted on Sat Jul 19 08:47:36 2025 UTC

https://www.reddit.com/gallery/1m3rg4i

Post too soon, and you publish a fossil

Posted on Sat Jul 19 09:45:08 2025 UTC

https://i.redd.it/983x6fu3ysdf1.jpeg

I am feeling extremely anxious over the chatgpt Math olympiad results, what exactly are humans supposed to do now?

Posted on Sat Jul 19 11:18:18 2025 UTC

/r/singularity/comments/1m3tras/i_am_feeling_extremely_anxious_over_the_chatgpt/

I loved to learn new things, and from a personal perspective, always wanted myself to be smarter than my previous self.
I loved math and physics.

Now I feel, all that is in vain, as this LLM is going to do what I want to do, and do it even better.
The other day I was making a 3 body problem visualiser for half a day. But some guy on twitter one-shotted a black hole visualiser using Grok Heavy.

I liked doing the "intellectually heavy" tasks. Now? I feel LLM will defeat me in this. If not today, 2 years from now. What exactly am I supposed to do. Art? Gone. Music? Gone. Programming, my passion? Gone. Math and Physics? Going soon. The only thing left to do is be a company founder of sorts, forming just the problem statement, and use these tools to solve problems. But I wanted to be the problem solver.

Edit : Art, music and other fun things may still be relevant. But when its about pushing the boundaries of humanity, I feel humans will no longer be needed.

Sam Altman on the model

Posted on Sat Jul 19 14:20:37 2025 UTC

https://i.redd.it/mww4r7h6budf1.png

He is starting to beleive

Posted on Sat Jul 19 15:35:48 2025 UTC

https://i.redd.it/fyemq6sloudf1.png

GPT-5 reasoning alpha

Posted on Sat Jul 19 11:40:35 2025 UTC

https://i.redd.it/2t39cwbpitdf1.jpeg

https://github.com/SecureBio-ai/bio...6f78ed08b880fa819e7c5/configs/config.yaml#L12

OpenAI achieved IMO gold with experimental reasoning model; they also will be releasing GPT-5 soon

Posted on Sat Jul 19 08:08:40 2025 UTC

https://www.reddit.com/gallery/1m3qutl

bnew · Jul 19, 2025

1/37
@alexwei_
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

2/37
@alexwei_
2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

3/37
@alexwei_
3/N Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).

4/37
@alexwei_
4/N Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.

5/37
@alexwei_
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

6/37
@alexwei_
6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!

7/37
@alexwei_
7/N HUGE congratulations to the team—@SherylHsu02, @polynoamial, and the many giants whose shoulders we stood on—for turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.

8/37
@alexwei_
8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

9/37
@alexwei_
9/N Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.

10/37
@alexwei_
10/N If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model

)

GitHub - aw31/openai-imo-2025-proofs

11/37
@alexwei_
11/N Lastly, we'd like to congratulate all the participants of the 2025 IMO on their achievement! We are proud to have many past IMO participants at @OpenAI and recognize that these are some of the brightest young minds of the future.

12/37
@burny_tech
Soooo what is the breakthrough?
>"Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians."
>"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."

13/37
@burny_tech
so let me get this straight

their model basically competed live on IMO so all the mathematical tasks should be novel enough

all previous years IMO tasks in benchmarks are fully saturated in big part because of data contamination as it doesn't generalize to these new ones

so... this new model seems to... generalize well to novel enough mathematical tasks??? i dont know what to think

14/37
@AlbertQJiang
Congratulations!

15/37
@geo58928
Amazing

16/37
@burny_tech
So public AI models are bad at IMO, while internal models are getting gold medals? Fascinating

17/37
@mhdfaran
@grok who was on second and third

18/37
@QuanquanGu
Congrats, this is incredible results!
Quick question: did it use Lean, or just LLM?
If it’s just LLM… that’s insane.

19/37
@AISafetyMemes
So what's the next goalpost?

What's the next thing LLMs will never be able to do?

20/37
@kimmonismus
Absolutely fantastic

21/37
@CtrlAltDwayne
pretty impressive. is this the anonymous chatbot we're seeing on webdev arena by chance?

22/37
@burny_tech
lmao

23/37
@jack_w_rae
Congratulations! That's an incredible result, and a great moment for AI progress. You guys should release the model

24/37
@Kyrannio
Incredible work.

25/37
@burny_tech
Sweet Bitter lesson

26/37
@burny_tech
"We developed new techniques that make LLMs a lot better at hard-to-verify tasks."
A general method? Or just for mathematical proofs? Is Lean somehow used, maybe just in training?

27/37
@elder_plinius

28/37
@skominers

29/37
@javilopen
Hey @GaryMarcus, what are your thoughts about this?

30/37
@pr0me
crazy feat, congrats!
nice that you have published the data on this

31/37
@danielhanchen
Impressive!

32/37
@IamEmily2050
Congratulations

33/37
@burny_tech
Step towards mathematical superintelligence

34/37
@reach_vb
Massive feat! I love how concise and to the point the generations are unlike majority of LLMs open/ closed alike

35/37
@DCbuild3r
Congratulations!

36/37
@DoctorYev
I just woke up and this post has 1M views after a few hours.

AI does not sleep.

37/37
@AndiBunari1
@grok summarize this and simple to understand

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/10
@polynoamial
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline

[Quoted tweet]
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

2/10
@polynoamial
Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

3/10
@polynoamial
So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.

4/10
@polynoamial
Also this model thinks for a *long* time. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

[Quoted tweet]
@OpenAI's o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots

5/10
@polynoamial
It’s worth reflecting on just how fast AI progress has been, especially in math. In 2024, AI labs were using grade school math (GSM8K) as an eval in their model releases. Since then, we’ve saturated the (high school) MATH benchmark, then AIME, and now are at IMO gold.

6/10
@polynoamial
Where does this go? As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery. There’s a big difference between AI slightly below top human performance vs slightly above.

7/10
@polynoamial
This was a small team effort led by @alexwei_. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community.

8/10
@polynoamial
When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is.

9/10
@posedscaredcity
But yann lec00n says accuracy scales inversely to output length and im sure industry expert gary marcus would agree

10/10
@mrlnonai
will API cost be astronomical for this?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

More options

null

...

bnew

Veteran

bnew

Veteran

GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

AlpacaEval Leaderboard

head shots101

North Bronx Blocks!!!

bnew

Veteran

bnew

Veteran

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

bnew

Veteran

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

bnew

Veteran

bnew

Veteran

REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

...

Veteran

Veteran

North Bronx Blocks!!!

Veteran

Veteran

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels​

Veteran

Veteran

Veteran

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels