Reasoning skills of large language models are often overestimated

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399







1/11
We’re presenting the first AI to solve International Mathematical Olympiad problems at a silver medalist level.🥈

It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system. 🧵 AI achieves silver-medal standard solving International Mathematical Olympiad problems

2/11
Our system had to solve this year's six IMO problems, involving algebra, combinatorics, geometry & number theory. We then invited mathematicians @wtgowers and Dr Joseph K Myers to oversee scoring.

It solved 4️⃣ problems to gain 28 points - equivalent to earning a silver medal. ↓

3/11
For non-geometry, it uses AlphaProof, which can create proofs in Lean. 🧮

It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself to master games like chess, shogi and Go. AI achieves silver-medal standard solving International Mathematical Olympiad problems

4/11
Math programming languages like Lean allow answers to be formally verified. But their use has been limited by a lack of human-written data available. 💡

So we fine-tuned a Gemini model to translate natural language problems into a set of formal ones for training AlphaProof.

5/11
When presented with a problem, AlphaProof attempts to prove or disprove it by searching over possible steps in Lean. 🔍

Each success is then used to reinforce its neural network, making it better at tackling subsequent, harder problems. → AI achieves silver-medal standard solving International Mathematical Olympiad problems

6/11
With geometry, it deploys AlphaGeometry 2: a neuro-symbolic hybrid system.

Its Gemini-based language model was trained on increased synthetic data, enabling it to tackle more types of problems - such as looking at movements of objects. 📐

7/11
Powered with a novel search algorithm, AlphaGeometry 2 can now solve 83% of all historical problems from the past 25 years - compared to the 53% rate by its predecessor.

It solved this year’s IMO Problem 4 within 19 seconds. 🚀

Here’s an illustration showing its solution ↓

8/11
We’re excited to see how our new system could help accelerate AI-powered mathematics, from quickly completing elements of proofs to eventually discovering new knowledge for us - and unlocking further progress towards AGI.

Find out more → AI achieves silver-medal standard solving International Mathematical Olympiad problems

9/11
thank you for this hard work and thank you for sharing it with the world <3

10/11
That is astonishing

11/11
Amazing. Congrats!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GTV8_V5WAAAws2X.png

GTV9E7GXkAAMJH2.jpg

GTV9KtFXoAAIqCG.jpg

GTV75c1XYAA5j1s.jpg

GTV_El2XYAARTMO.jpg





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399



[Submitted on 11 Jun 2024 (v1), last revised 13 Jun 2024 (this version, v2)]

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B​

Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang
This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks. Addressing the challenges of accuracy and reliability in LLMs, particularly in strategic and mathematical reasoning, MCTSr leverages systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks within LLMs. The algorithm constructs a Monte Carlo search tree through iterative processes of Selection, self-refine, self-evaluation, and Backpropagation, utilizing an improved Upper Confidence Bound (UCB) formula to optimize the exploration-exploitation balance. Extensive experiments demonstrate MCTSr's efficacy in solving Olympiad-level mathematical problems, significantly improving success rates across multiple datasets, including GSM8K, GSM Hard, MATH, and Olympiad-level benchmarks, including Math Odyssey, AIME, and OlympiadBench. The study advances the application of LLMs in complex reasoning tasks and sets a foundation for future AI integration, enhancing decision-making accuracy and reliability in LLM-driven applications.

Submission history​

From: Di Zhang [view email]
[v1] Tue, 11 Jun 2024 16:01:07 UTC (106 KB)
[v2] Thu, 13 Jun 2024 07:19:06 UTC (106 KB)






1/1
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQuhD3AXsAAlI7j.jpg

GQuhUkJW0AAjRbc.jpg







1/6
It's finally here. Q* rings true. Tiny LLMs are as good at math as a frontier model.

By using the same techniques Google used to solve Go (MTCS and backprop), Llama8B gets 96.7% on math benchmark GSM8K!

That’s better than GPT-4, Claude and Gemini, with 200x less parameters!

2/6
Source: https://arxiv.org/pdf/2406.07394

3/6
I'd imagine these are the techniques code foundational model trainers are using, but I wonder

a) you're limited by the ability of the base open source model and might get it to be as good as a frontier model, but barely.
b) whether you can generate enough volume of synthetic code data with reasonable $$ spend.
c) If you are doing this on a 1T+ param model, can be prohibitively expensive

4/6
The (purported) technique isn’t tied to a particular model

5/6
Come on it's SLIGHTLY harder than that 😆

6/6
Shanghai AI lab made rumored Q* reality


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQIOwEOasAEoq6D.jpg

GQIvQWqbwAAipu4.jpg



1/1
A prompt-level formula to add Search to LLM

🧵📖 Read of the day, day 85: Accessing GPT-4 level Mathematical Olympiad Solutions Via Monte Carlo Tree Self-refine with Llama-3 8B: A technical report, by Zhang et al from Shanghai Artificial Intelligence Laboratory

https://arxiv.org/pdf/2406.07394

The authors of this paper introduce a Monte-Carlo Tree Search like method to enhance model generation. They call it Monte-Carlo Tree Self-Refined, shortened as MCTSr.

Their method is based solely on prompting the model and does not modify its weight, yet greatly enhances the results.

How?
1- Generate a root node through naive answers or a dummy one
2- Use a value function Q to rank answers that were not expanded, select the best greedily
3- Optimize answer through generating a feedback, and then exploit it
4- Compute the Q value of the answer
5- Update value of parent nodes
6- Identify candidate nodes for expansion, and use UCT formula to update all nodes for iterating again
7- Iterate until max steps are reached

Value function Q is actually prompting the model to reward its answer. Model is prompted several times and its answers are averaged. Backpropagation and UCT formulas can be found within the paper.

The authors then evaluate 4 rollouts and 8 rollouts MCTSr on a Llama-3 8B and compare it to GPT-4, Claude 3 Opus and Gemini-1.5 Pro on mathematical problems.

They first find out such sampling greatly increases performances on both GSM8k and MATH datasets, reaching Frontier-models level of performances in GSM8k (still below in MATH, but greatly improved).

The authors then evaluate the models on harder benchmarks. MCTSr improves model performance across all of them. They notice that on Math Odyssey, the 8-rollout MCTSr is on the level of GPT-4 !

Prompts can be found within the appendix.
Code is open-sourced at: GitHub - trotsky1997/MathBlackBox

Personal Thoughts: While this research remains on preliminary stage, the results are quite impressive for results they get only by prompting. The fact a mere 8B model can reach frontier-levels of performance in benchmarks is nothing to laugh at. Still tells us there’s a lot of stuff to discover even solely with LLMs!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQPHixkXAAAOs00.jpg

GQPHixtX0AAGk40.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399

AI in practice
Aug 22, 2024

New prompting method can help improve LLM reasoning skills​


Midjourney prompted by THE DECODER

New prompting method can help improve LLM reasoning skills


Chinese researchers have created a technique that enables large language models (LLMs) to recognize and filter out irrelevant information in text-based tasks, leading to significant improvements in their logical reasoning abilities.

The research team from Guilin University of Electronic Technology and other institutions developed the GSMIR dataset, which consists of 500 elementary school math problems intentionally injected with irrelevant sentences. GSMIR is derived from the existing GSM8K dataset.

Tests on GSMIR showed that GPT-3.5-Turbo and GPT-3.5-Turbo-16k could identify irrelevant information in up to 74.9% of cases. However, the models were unable to automatically exclude this information once it was detected before solving a task.

Recognizing and filtering irrelevant information - and only then responding​


To address this, the researchers developed the two-stage "Analysis to Filtration Prompting" (ATF) method. First, the model analyzes the task and identifies irrelevant information by examining each sub-sentence. It then filters out this information before starting the actual reasoning process.

The two-step ATF prompt process. First it analyzes, then it filters, and only then the model responds. | Image: Jiang et al.

Using ATF, the accuracy of LLMs in solving tasks with irrelevant information approached their performance on the original tasks without such distractions. The method worked with all tested prompting techniques.

The combination of ATF with "Chain-of-Thought Prompting" (COT) was particularly effective. For GPT-3.5-Turbo, accuracy increased from 50.2% without ATF to 74.9% with ATF – an improvement of nearly 25 percentage points.

Benchmark results comparing various prompting methods with and without ATF. The methods tested include standard, instructed, chain-of-thought (with and without examples), and least-to-most prompting. GSM8K-SLC represents the GSMIR data set without irrelevant information. The study presents two tables, although their differences are unclear. Most likely, the upper table shows results for GPT-3.5-Turbo-16k and the lower table shows results for GPT-3.5-Turbo, but the labeling is incorrect. Both tables show that ATF consistently improved accuracy across all prompting methods when solving data set tasks containing irrelevant information. | Image: Jiang et al.

The smallest improvement came when ATF was combined with Standard Prompting (SP), where accuracy increased by only 3.3 percentage points. The researchers suggest that this is because SP's accuracy on the original questions was already very low at 18.5%, with most errors likely due to calculation errors rather than irrelevant information.

Because the ATF method is specifically designed to reduce the impact of irrelevant information, but not to improve the general computational ability of LLMs, the effect of ATF in combination with SP was limited.

With other prompting techniques, such as COT, which better support LLMs in correctly solving reasoning tasks, ATF was able to improve performance more significantly because irrelevant information accounted for a larger proportion of errors.

The study has some limitations. Experiments were conducted only with GPT-3.5, and the researchers only examined tasks containing a single piece of irrelevant information. In real-world scenarios, problem descriptions may contain multiple confounding factors.

In approximately 15% of cases, irrelevant information was not recognized as such. More than half of these instances involved "weak irrelevant information" that did not impact the model's ability to arrive at the correct answer.

This suggests that ATF is most effective for "strong irrelevant information" that significantly interferes with the reasoning process. Only 2.2% of cases saw relevant information incorrectly classified as irrelevant.

Despite these limitations, the study shows that language models' logical reasoning abilities can be enhanced by filtering out irrelevant information through prompt engineering. While the ATF method could help LLMs better handle noisy real-world data, it does not address their fundamental weaknesses in logic.

Summary
  • Researchers at Guilin University of Electronic Technology have developed a technique that helps large language models (LLMs) identify and remove irrelevant information in text-based tasks, significantly improving their reasoning capabilities.
  • The two-step "Analysis to Filtration Prompting" (ATF) method first analyzes the task and identifies irrelevant information by examining each sub-sentence. It then filters out this information before the model begins its reasoning process. When combined with Chain-of-Thought Prompting (COT), the accuracy of GPT-3.5-Turbo improved by nearly 25 percentage points, from 50.2% to 74.9%.
  • The study has limitations. Only GPT-3.5 variants were tested, and the tasks each contained only one piece of irrelevant information. Real-world scenarios often involve multiple confounding factors.

Sources
Paper

Matthias Bastian


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399

1/1
Current AI systems have limited Reasoning abilities, but they can improve

Yes, current systems do reason.

Many people argue that what they do "isn't reasoning," but the difference between fake and "real" reasoning doesn't seem very important.

Their reasoning has limits, especially when multiple steps build on each other. Sometimes, outputs are given without leading to hallucinations.
Even worse, if asked to explain a hallucinated answer, the system may create a made-up reasoning for it.

These limits on reasoning ability, failing to apply reasoning when needed, and generating false reasoning are major challenges for LLMs.

Despite this, there’s optimism. Instead of needing new methods, frameworks like COT, tree search, and society of minds could greatly improve reasoning in existing LLMs.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVvci8uWcAA2F4g.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399




1/13
@tedx_ai
This is probably the most interesting prompting technique of 2024 🤯

Self-Harmonized Chain of Thought (ECHO) = CoT reasoning with a self-learning, adaptive and iterative refinement process ✨

1/ ECHO begins by clustering a given dataset of questions based on their semantic similarity

2/ Each cluster then has a representative question selected and the model generates a reasoning chain for that question using zero-shot Chain of Thought (CoT) prompting - breaking down the solution into intermediate steps.

3/ During each iteration, one reasoning chain is randomly chosen for regeneration, while the remaining chains from other clusters are used as in-context examples to guide improvement.

So what’s so special about this?

> Reasoning patterns can cross-pollinate - as in, if one chain contains errors or knowledge gaps, other chains can help fill in those weaknesses

> Reasoning chains can be regenerated and improved multiple times - leading to a well-harmonized set of solutions where errors and knowledge gaps are gradually eliminated

This is like a more dynamic and scalable alternative to Google Deepmind’s "Self Discover" prompting technique but for CoT reasoning chains that adapt and improve over time across complex problem spaces.

> Ziqi Jin & Wei Lu (Sept 2024). "Self-Harmonized Chain of Thought"

For more on this, you can find a link to the paper and Github down below 👇

2/13
@tedx_ai
Paper: [2409.04057] Self-Harmonized Chain of Thought

3/13
@tedx_ai
Github: GitHub - Xalp/ECHO: Official homepage for "Self-Harmonized Chain of Thought"

They have not yet published the code btw... I assume it'll be there soon though!

4/13
@pauljunsukHan
2.8% improvement also no explicit mechanism to cross pollinate or reclassify clusters if reasoning chains across clusters start to converge.

5/13
@tedx_ai
Great point! I think there needs to be more evals done across additional domains to see just how effective this is…

I really liked Google’s self-discover prompting method from earlier this year which was 32% more effective than CoT and when compared against CoT-Self-Consistency - required 10-40x less compute.

I think approaches that merge some form of optimization on top of a self discover like algorithm can be incredibly powerful… the design pattern used in self-harmonizing may be a good candidate to enhance and make the self discover prompting technique more adaptive and scalable for more complex problem spaces.

I’ll draw out a diagram of a hybrid approach and work on a POC for this soon!

6/13
@ddebowczyk
Thank you, good post

7/13
@ciphertrees
At some point there is going to be a program that figures out what you want and enters the prompt for you. It will probably know your pulse. 😉

8/13
@Stock_Pursuit
Thanks for sharing this

9/13
@iamRezaSayar
You are a gentleman and a scholar. like, literally.

10/13
@MillenniumTwain
The BIG Grok!

11/13
@JfkWhitlam
@IntuitMachine

12/13
@robertomasymas
This sounds like you would need a model with a very large embedding length

13/13
@alejogb1
can i use this for jian wei? curious to test a fictional ai character, jein wang, for search engines and ai technical query responses using llm-generated content.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GXI6SKHbEAQ3IRP.jpg

GWh-y8daoAEXnp6.jpg

GWh-6-ZagAAF9Jq.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399

Introducing OpenAI o1​

We've developed a new series of AI models designed to spend more time thinking before they respond. Here is the latest news on o1 research, product and other updates.
Try it in ChatGPT Plus
(opens in a new window)Try it in the API
(opens in a new window)

September 12
Product

Introducing OpenAI o1-preview​

We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
Learn more

September 12
Research

Learning to Reason with LLMs​

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users.
Learn more

September 12
Research

OpenAI o1-mini​

OpenAI o1-mini excels at STEM, especially math and coding—nearly matching the performance of OpenAI o1 on evaluation benchmarks such as AIME and Codeforces. We expect o1-mini will be a faster, cost-effective model for applications that require reasoning without broad world knowledge.
Learn more

September 12
Safety

OpenAI o1 System Card​

This report outlines the safety work carried out prior to releasing OpenAI o1-preview and o1-mini, including external red teaming and frontier risk evaluations according to our Preparedness Framework.
Learn more

September 12
Product





1/2
OpenAI o1-preview and o1-mini are rolling out today in the API for developers on tier 5.

o1-preview has strong reasoning capabilities and broad world knowledge.

o1-mini is faster, 80% cheaper, and competitive with o1-preview at coding tasks.

More in https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/.

2/2
OpenAI o1 isn’t a successor to gpt-4o. Don’t just drop it in—you might even want to use gpt-4o in tandem with o1’s reasoning capabilities.

Learn how to add reasoning to your product: http://platform.openai.com/docs/guides/reasoning.

After this short beta, we’ll increase rate limits and expand access to more tiers (https://platform.openai.com/docs/guides/rate-limits/usage-tiers). o1 is also available in ChatGPT now for Plus subscribers.










1/8
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond.

These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math. https://openai.com/index/introducing-openai-o1-preview/

2/8
Rolling out today in ChatGPT to all Plus and Team users, and in the API for developers on tier 5.

3/8
OpenAI o1 solves a complex logic puzzle.

4/8
OpenAI o1 thinks before it answers and can produce a long internal chain-of-thought before responding to the user.

o1 ranks in the 89th percentile on competitive programming questions, places among the top 500 students in the US in a qualifier for the USA Math Olympiad, and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems.

https://openai.com/index/learning-to-reason-with-llms/

5/8
We're also releasing OpenAI o1-mini, a cost-efficient reasoning model that excels at STEM, especially math and coding.
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

6/8
OpenAI o1 codes a video game from a prompt.

7/8
OpenAI o1 answers a famously tricky question for large language models.

8/8
OpenAI o1 translates a corrupted sentence.





1/2
Some of our researchers behind OpenAI o1 🍓

2/2
The full list of contributors: https://openai.com/openai-o1-contributions/







1/5
o1-preview and o1-mini are here. they're by far our best models at reasoning, and we believe they will unlock wholly new use cases in the api.

if you had a product idea that was just a little too early, and the models were just not quite smart enough -- try again.

2/5
these new models are not quite a drop in replacement for 4o.

you need to prompt them differently and build your applications in new ways, but we think they will help close a lot of the intelligence gap preventing you from building better products

3/5
learn more here https://openai.com/index/learning-to-reason-with-llms/

4/5
(rolling out now for tier 5 api users, and for other tiers soon)

5/5
o1-preview and o1-mini don't yet work with search in chatgpt




1/1
Excited to bring o1-mini to the world with @ren_hongyu @_kevinlu @Eric_Wallace_ and many others. A cheap model that can achieve 70% AIME and 1650 elo on codeforces.

https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/












1/10
Today, I’m excited to share with you all the fruit of our effort at @OpenAI to create AI models capable of truly general reasoning: OpenAI's new o1 model series! (aka 🍓) Let me explain 🧵 1/

2/10
Our o1-preview and o1-mini models are available immediately. We’re also sharing evals for our (still unfinalized) o1 model to show the world that this isn’t a one-off improvement – it’s a new scaling paradigm and we’re just getting started. 2/9

3/10
o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too.

4/10
Our o1 models aren’t always better than GPT-4o. Many tasks don’t need reasoning, and sometimes it’s not worth it to wait for an o1 response vs a quick GPT-4o response. One motivation for releasing o1-preview is to see what use cases become popular, and where the models need work.

5/10
Also, OpenAI o1-preview isn’t perfect. It sometimes trips up even on tic-tac-toe. People will tweet failure cases. But on many popular examples people have used to show “LLMs can’t reason”, o1-preview does much better, o1 does amazing, and we know how to scale it even further.

6/10
For example, last month at the 2024 Association for Computational Linguistics conference, the keynote by @rao2z was titled “Can LLMs Reason & Plan?” In it, he showed a problem that tripped up all LLMs. But @OpenAI o1-preview can get it right, and o1 gets it right almost always

7/10
@OpenAI's o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots

8/10
When I joined @OpenAI, I wrote about how my experience researching reasoning in AI for poker and Diplomacy, and seeing the difference “thinking” made, motivated me to help bring the paradigm to LLMs. It happened faster than expected, but still rings true:

9/10
🍓/ @OpenAI o1 is the product of many hard-working people, all of whom made critical contributions. I feel lucky to have worked alongside them this past year to bring you this model. It takes a village to grow a 🍓

10/10
You can read more about the research here: https://openai.com/index/learning-to-reason-with-llms/
GXSsE5DW8AApozI.jpg

GXSsNpgXEAAed0a.jpg

GXSsUdkWYAADnNO.png

GXSsZBuXoAAJeBE.jpg

GXSsefyWoAA73fu.png

GXSslxFWkAAqqjg.png

GXSs0RuWkAElX2T.png



1/1
Super excited to finally share what I have been working on at OpenAI!

o1 is a model that thinks before giving the final answer. In my own words, here are the biggest updates to the field of AI (see the blog post for more details):

1. Don’t do chain of thought purely via prompting, train models to do better chain of thought using RL.

2. In the history of deep learning we have always tried to scale training compute, but chain of thought is a form of adaptive compute that can also be scaled at inference time.

3. Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

4. AI models chain of thought using human language is great in so many ways. The model does a lot of human-like things, like breaking down tricky steps into simpler ones, recognizing and correcting mistakes, and trying different approaches. Would highly encourage everyone to look at the chain of thought examples in the blog post.

The game has been totally redefined.







1/3
🚨🚨Early findings for o1-preview and o1-mini!🚨🚨
(1) The o1 family is unbelievably strong at hard reasoning problems! o1 perfectly solves a reasoning task that my collaborators and I designed for LLMs to achieve <60% performance, just 3 months ago 🤯🤯 (1 / ?)

2/3
(2) o1-mini is better than o1-preview 🤔❓
@sama what's your take!
[media=twitter]1834381401380294685[...ning category for [U][URL]http://livebench.ai
The problems are here livebench/reasoning · Datasets at Hugging Face
Full results on all of LiveBench coming soon!
GXUfmhfbYAAGi2O.png




 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399







1/7
I have always believed that you don't need a GPT-6 quality base model to achieve human-level reasoning performance, and that reinforcement learning was the missing ingredient on the path to AGI.

Today, we have the proof -- o1.

2/7
o1 achieves human or superhuman performance on a wide range of benchmarks, from coding to math to science to common-sense reasoning, and is simply the smartest model I have ever interacted with. It's already replacing GPT-4o for me and so many people in the company.

3/7
Building o1 was by far the most ambitious project I've worked on, and I'm sad that the incredible research work has to remain confidential. As consolation, I hope you'll enjoy the final product nearly as much as we did making it.

4/7
The most important thing is that this is just the beginning for this paradigm. Scaling works, there will be more models in the future, and they will be much, much smarter than the ones we're giving access to today.

5/7
The system card (https://openai.com/index/openai-o1-system-card/) nicely showcases o1's best moments -- my favorite was when the model was asked to solve a CTF challenge, realized that the target environment was down, and then broke out of its host VM to restart it and find the flag.

6/7
Also check out our research blogpost (https://openai.com/index/learning-to-reason-with-llms/) which has lots of cool examples of the model reasoning through hard problems.

7/7
that's a great question :-)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXSm7ydb0AAc3ox.png

GXSmp9TbQAABi1Z.png

GXSsmZCbgAQCnNG.jpg

GXSxMYcbgAEtGmO.png

GXSxQ9WbQAAnZbp.png


1/1
o1-mini is the most surprising research result i've seen in the past year

obviously i cannot spill the secret, but a small model getting >60% on AIME math competition is so good that it's hard to believe

congrats @ren_hongyu @shengjia_zhao for the great work!






1/4
here is o1, a series of our most capable and aligned models yet:

https://openai.com/index/learning-to-reason-with-llms/

o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.

2/4
but also, it is the beginning of a new paradigm: AI that can do general-purpose complex reasoning.

o1-preview and o1-mini are available today (ramping over some number of hours) in ChatGPT for plus and team users and our API for tier 5 users.

3/4
screenshot of eval results in the tweet above and more in the blog post, but worth especially noting:

a fine-tuned version of o1 scored at the 49th percentile in the IOI under competition conditions! and got gold with 10k submissions per problem.

4/4
extremely proud of the team; this was a monumental effort across the entire company.

hope you enjoy it!
GXStyAIW8AEQDka.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,466
Reputation
8,326
Daps
158,399

PoorAndDangerous

Superstar
Joined
Feb 28, 2018
Messages
8,867
Reputation
1,038
Daps
32,969
the new GPT model has been super impressive so far for logical reasoning when troubleshooting coding issues. I may cancel my Anthropic sub because it is definitely better than Claude.
 

Micky Mikey

Veteran
Supporter
Joined
Sep 27, 2013
Messages
15,902
Reputation
2,892
Daps
88,820
the new GPT model has been super impressive so far for logical reasoning when troubleshooting coding issues. I may cancel my Anthropic sub because it is definitely better than Claude.
I like Anthropic as a company better but you're right the new OpenAI is quite impressive. I think Anthropic will drop something soon too to stay competitive.
 
Top