Reasoning skills of large language models are often overestimated

bnew · Jul 12, 2024

Reasoning skills of large language models are often overestimated

MIT CSAIL researchers developed an evaluation framework for large language models about counterfactual tasks. They found that LLMs can recite answers, but struggle to reason as it relates to abstract task-solving.

news.mit.edu

Reasoning skills of large language models are often overestimated

New CSAIL research highlights how LLMs excel in familiar scenarios but struggle in novel ones, questioning their true reasoning abilities versus reliance on memorization.

Rachel Gordon | MIT CSAIL

Publication Date:

July 11, 2024

PRESS INQUIRIES

A cartoon android recites an answer to a math problem from a textbook in one panel and reasons about that same answer in another

Caption:

MIT researchers examined how LLMs fare with variations of different tasks, putting their memorization and reasoning skills to the test. The result: Their reasoning abilities are often overestimated.

Credits:

Image: Alex Shipps/MIT CSAIL

When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.

MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to examine how LLMs fare with variations of different tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.

The study compared “default tasks,” the common tasks a model is trained and tested on, with “counterfactual scenarios,” hypothetical situations deviating from default conditions — which models like GPT-4 and Claude can usually be expected to cope with. The researchers developed some tests outside the models’ comfort zones by tweaking existing tasks instead of creating entirely new ones. They used a variety of datasets and benchmarks specifically tailored to different aspects of the models' capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc.

When users interact with language models, any arithmetic is usually in base-10, the familiar number base to the models. But observing that they do well on base-10 could give us a false impression of them having strong competency in addition. Logically, if they truly possess good addition skills, you’d expect reliably high performance across all number bases, similar to calculators or computers. Indeed, the research showed that these models are not as robust as many initially think. Their high performance is limited to common task variants and suffer from consistent and severe performance drop in the unfamiliar counterfactual scenarios, indicating a lack of generalizable addition ability.

The pattern held true for many other tasks like musical chord fingering, spatial reasoning, and even chess problems where the starting positions of pieces were slightly altered. While human players are expected to still be able to determine the legality of moves in altered scenarios (given enough time), the models struggled and couldn’t perform better than random guessing, meaning they have limited ability to generalize to unfamiliar situations. And much of their performance on the standard tasks is likely not due to general task abilities, but overfitting to, or directly memorizing from, what they have seen in their training data.

“We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research. “As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness.”

Despite the insights gained, there are, of course, limitations. The study’s focus on specific tasks and settings didn’t capture the full range of challenges the models could potentially encounter in real-world applications, signaling the need for more diverse testing environments. Future work could involve expanding the range of tasks and counterfactual conditions to uncover more potential weaknesses. This could mean looking at more complex and less common scenarios. The team also wants to improve interpretability by creating methods to better comprehend the rationale behind the models’ decision-making processes.

“As language models scale up, understanding their training data becomes increasingly challenging even for open models, let alone proprietary ones,” says Hao Peng, assistant professor at the University of Illinois at Urbana-Champaign. “The community remains puzzled about whether these models genuinely generalize to unseen tasks, or seemingly succeed by memorizing the training data. This paper makes important strides in addressing this question. It constructs a suite of carefully designed counterfactual evaluations, providing fresh insights into the capabilities of state-of-the-art LLMs. It reveals that their ability to solve unseen tasks is perhaps far more limited than anticipated by many. It has the potential to inspire future research towards identifying the failure modes of today’s models and developing better ones.”

Additional authors include Najoung Kim, who is a Boston University assistant professor and Google visiting researcher, and seven CSAIL affiliates: MIT electrical engineering and computer science (EECS) PhD students Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former postdoc and Apple AI/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.

The team’s study was supported, in part, by the MIT–IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation. The team presented the work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.

Professor Emeritus · Jul 12, 2024

Similar logic applies for at least a couple posters on here.

bnew · Jul 12, 2024

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

{HTML}

[Submitted on 5 Jul 2023 (v1), last revised 28 Mar 2024 (this version, v3)]

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Zhaofeng Wu

Abstract:The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.

Comments:	NAACL 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2307.02477
	arXiv:2307.02477v3
	[2307.02477] Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2307.02477

bnew · Jul 12, 2024

it should be noted that although they reference chatgpt/gpt-4 and claude, there are specific versions of them and obviously all models have improved since the study was done.

GPT-4 (gpt-4-0314) was launched by OpenAI on March 14, 20231 2.
GPT-3.5 (gpt-3.5-turbo-0301) was introduced in August 2023 with fine-tuning capabilities, allowing customization for specific use cases 3.
Claude 3.5 Sonnet, part of the Claude model family, was officially released on June 20, 2024. It outperforms competitor models and its predecessor, Claude 3 Opus, across various evaluations 4 5.
PaLM-2 (text-bison-001) was unveiled by Google AI in May 2023. It excels in multilingual tasks, reasoning, coding, and more 6 7.

snippet from paper:

4 Results

For each task, we evaluate GPT-4 (gpt-4-0314; OpenAI, 2023), GPT-3.5 (gpt-3.5-turbo-0301), Claude (claude-v1.3; Anthropic, 2023), and PaLM-2 (text-bison-001; Anil et al., 2023). As these are closed-source models, we do not have any information regarding their size, architecture, and pretaining details.6 We note that the largest PaLM-2 model is not publicly accessible, and we can only test the second-largest version. For each task, we experiment both with and without encouraging the model to reason step by step, by adding the phrase “Let’s think step by step.” in our prompts (Kojima et al., 2023; Reynolds and McDonell, 2021). Following Kojima et al. (2023), we refer to this step-by-step setup as zero-shot chain-of-thought prompting (0-CoT; Nye et al., 2021; Wei et al., 2022). We include all prompts in §B.

Figures 2 and 3 show our results. §C contains the numeric version. We see a consistent pattern where LMs perform substantially worse on the counterfactual task variants, both with and without 0-shot CoT. For most cases, LMs exhibit an above-random counterfactual performance, suggesting some degree of the targeted ability. However, when the CCC accuracy is high, as is usually the case for GPT-4 and in select settings for other models too, the gaps in default vs. counterfactual task performance demonstrate limitations in their abstract capacity to solve the target task. When the CCC accuracy is lower, the failure of counterfactual world comprehension would be a confounder to this conclusion, but often the gaps are so large (sometimes even dropping from near-perfect to near-zero, such as for arithmetic) that they are nonetheless strongly indicative of non-transferable, default-condition-specific implementations of the original task. The fact that the LMs sometimes cannot evaluate the CCC well under the counterfactual conditions, but can do so under the default conditions (e.g., for arithmetic, programming, drawing, etc.) itself also points to overfitting to the latter.

bnew · Jul 12, 2024

1/10
LLM agents have demonstrated promise in their ability to automate computer tasks, but face challenges with multi-step reasoning and planning. Towards addressing this, we propose an inference-time tree search algorithm for LLM agents to explicitly perform exploration and multi-step planning in interactive web environments.

It is the first tree search algorithm for LLM agents that shows effectiveness on realistic and complex web environments: on the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%.

2/10
We show in ablations that spending more compute (increasing the search budget) improves success rate. Even doing a small amount of search (c=5) substantially improves over the baseline (24.5% to 32.0%), and using larger search budgets achieves even better results:

3/10
We also found that increasing the size of the search tree is essential: we need to expand search trees along both the depth (d) and breadth (b). Our best results are achieved with search trees of maximum depth 5 and branching factor 5:

4/10
Search also provides consistent improvements across a diverse set of sites in (Visual)WebArena, introducing a relative improvement on certain sites by as much as 50%!

5/10
Search can improve the robustness of agents by filtering out bad actions. Shown here is a trajectory where greedily picking the first sampled actions would have led to a failure (the path in the first row). Search avoids this by exploring and pruning less promising paths.

6/10
Here is another task on the WebArena CMS environment, where performing more exploration through search helps the model to identify a trajectory that is likely to be more successful:

7/10
Our method is compatible with any baseline LLM agents, and demonstrates gains for both gpt-4o and Llama-3.

I'm very excited to see how far we can scale search: this will be a key component for LLM agents to allow us to expend more inference-time compute for stronger results.

8/10
Project page: Tree Search for Language Model Agents
Paper: https://jykoh.com/search-agents/paper.pdf
Code: GitHub - kohjingyu/search-agents: Code for the paper

Tree Search for Language Model Agents

This work was done at CMU with
@McaleerStephen
@dan_fried

@rsalakhu

9/10
Pretty much yes, you can see our value function here:

10/10
Thanks John!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 12, 2024

1/7
Thrilled to share our latest paper “Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models”

LLMs struggle at higher depths of logical reasoning

Check out paper @ [2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

#NLProc #logic #reasoning

Read ↓ (1/6)

2/7
Proposed Multi-LogiEval, a systematically created QA dataset covering multi-step logical reasoning across three logic: Propositional Logic (PL), First-Order Logic (FOL), and Non-Monotonic (NM).

Access Multi-LogiEval @GitHub - Mihir3009/Multi-LogiEval: A comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths

Read ↓ (2/6)

3/7
Our dataset provides ~1.6k high-quality instances that cover 33 inference rules and reasoning patterns and more than 60 complex combinations of these inference rules with a different number of reasoning steps (1~5).

Read ↓ (3/6)

4/7
Our data creation process consists of two major stages: (i) Generation of rule combination and (ii) Generation of data instances.

Read ↓ (4/6)

5/7
Evaluating LLMs on Multi-LogiEval leads to interesting findings:

- Longer chains don't guarantee better reasoning
- Larger open-source models perform worse than smaller ones
- LLMs struggle with context-based conclusions without explicit knowledge
- Many more..

Read ↓ (5/6)

6/7
Want to evaluate your LLM? Check out the Multi-LogiEval dataset @
GitHub - Mihir3009/Multi-LogiEval: A comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths

Take on its challenges and be part of advancing the capabilities of LLMs in logical reasoning!

Thanks to
@Nisarg14P @Mohith nrj_varshney @mutsumi32141651 @cbaral

Read (6/6)

7/7
Super excited to share that our paper on high-quality data generation using LLMs is accepted
@COLM_conf

Please check out our work - [2310.17876] TarGEN: Targeted Data Generation with Large Language Models

#NLProc #LLMs

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
[Submitted on 24 Jun 2024]

Multi-LogiEval - Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, Chitta Baral

Abstract:
As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at this https URL.
this https URL

Comments:	23 Pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.17169
	arXiv:2406.17169v1
	[2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2406.17169
this https URL

PoorAndDangerous · Jul 13, 2024

anyone who actually uses them on a professional level sees how poor their reasoning skills are when things get complex, but these are the first versions. They will be insanely accurate 5 years from now that this will not even be relevant

bnew · Jul 13, 2024

when discussing LLM capabilities, a lot of people default to chatgpt so I thought I'd share a site that have over a dozen models to test to see what they are capable of.

https://chat.lmsys.org

bnew · Jul 13, 2024

[2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
[Submitted on 23 May 2024 (v1), last revised 27 May 2024 (this version, v2)]

Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization

Boshi Wang, Xiang Yue, Yu Su, Huan Sun

Abstract:
We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

Comments:	this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.15071
	arXiv:2405.15071v2
	[2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2405.15071

A.I Generated explanation:

Title: Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization

This is a research paper about artificial intelligence (AI) and how it can be improved to make it smarter.

Authors: Boshi Wang, Xiang Yue, Yu Su, and Huan Sun

These are the people who wrote the paper.

Abstract:

The paper is about whether a type of AI called transformers can learn to "reason" like humans do. Reasoning means making connections between different pieces of information and using them to make decisions or solve problems. The researchers found that transformers can learn to reason, but only if they are trained for a very long time. They also found that the way the transformers reason is different depending on the type of problem they are trying to solve.

What they did:

The researchers trained transformers to solve two types of problems: composition and comparison. They found that the transformers could learn to solve these problems, but only if they were trained for a very long time. They also looked at how the transformers were solving the problems and found that they were using different "circuits" in their "brain" to do so.

What they found:

The researchers found that the transformers were able to solve the problems, but they didn't always generalize well to new situations. This means that they could solve the problems they were trained on, but they didn't always understand the underlying principles well enough to apply them to new situations. They also found that the way the transformers were trained affected how well they could reason.

What it means:

This research is important because it shows that transformers can be trained to reason like humans do, but it also shows that there are still limitations to how well they can generalize to new situations. The researchers suggest that changing the way transformers are trained and adding new components to their architecture could help them reason better.

Links:

* [2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the paper on the arXiv website.
* GitHub - OSU-NLP-Group/GrokkedTransformer: Code for the paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization' - This is the link to the GitHub repository for the project.
* [2405.15071v2] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the updated version of the paper on the arXiv website.
* [2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the DOI (digital object identifier) for the paper.

bnew · Jul 13, 2024

https://www.reuters.com/technology/artificial-intelligence/openai-working-new-reasoning-technology-under-code-name-strawberry-2024-07-12/

Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’

By Anna Tong and Katie Paul

July 12, 20245:23 PM EDTUpdated 19 hours ago

OpenAI logo is seen in this illustration taken May 20, 2024. REUTERS/Dado Ruvic/Illustration/File Photo Purchase Licensing Rights, opens new tab

July 12 - ChatGPT maker OpenAI is working on a novel approach to its artificial intelligence models in a project code-named “Strawberry,” according to a person familiar with the matter and internal documentation reviewed by Reuters.

The project, details of which have not been previously reported, comes as the Microsoft-backed startup races to show that the types of models it offers are capable of delivering advanced reasoning capabilities.

Teams inside OpenAI are working on Strawberry, according to a copy of a recent internal OpenAI document seen by Reuters in May. Reuters could not ascertain the precise date of the document, which details a plan for how OpenAI intends to use Strawberry to perform research. The source described the plan to Reuters as a work in progress. The news agency could not establish how close Strawberry is to being publicly available.

How Strawberry works is a tightly kept secret even within OpenAI, the person said.

The document describes a project that uses Strawberry models with the aim of enabling the company’s AI to not just generate answers to queries but to plan ahead enough to navigate the internet autonomously and reliably to perform what OpenAI terms “deep research,” according to the source.

This is something that has eluded AI models to date, according to interviews with more than a dozen AI researchers.

Asked about Strawberry and the details reported in this story, an OpenAI company spokesperson said in a statement: “We want our AI models to see and understand the world more like we do. Continuous research into new AI capabilities is a common practice in the industry, with a shared belief that these systems will improve in reasoning over time.”

The spokesperson did not directly address questions about Strawberry.

The Strawberry project was formerly known as Q*, which Reuters reported last year was already seen inside the company as a breakthrough.

Two sources described viewing earlier this year what OpenAI staffers told them were Q* demos, capable of answering tricky science and math questions out of reach of today’s commercially-available models.

On Tuesday at an internal all-hands meeting, OpenAI showed a demo of a research project that it claimed had new human-like reasoning skills, according to Bloomberg, opens new tab. An OpenAI spokesperson confirmed the meeting but declined to give details of the contents. Reuters could not determine if the project demonstrated was Strawberry.

OpenAI hopes the innovation will improve its AI models’ reasoning capabilities dramatically, the person familiar with it said, adding that Strawberry involves a specialized way of processing an AI model after it has been pre-trained on very large datasets.

Researchers Reuters interviewed say that reasoning is key to AI achieving human or super-human-level intelligence.

While large language models can already summarize dense texts and compose elegant prose far more quickly than any human, the technology often falls short on common sense problems whose solutions seem intuitive to people, like recognizing logical fallacies and playing tic-tac-toe. When the model encounters these kinds of problems, it often “hallucinates” bogus information.

AI researchers interviewed by Reuters generally agree that reasoning, in the context of AI, involves the formation of a model that enables AI to plan ahead, reflect how the physical world functions, and work through challenging multi-step problems reliably.

Improving reasoning in AI models is seen as the key to unlocking the ability for the models to do everything from making major scientific discoveries to planning and building new software applications.

OpenAI CEO Sam Altman said earlier this year, opens new tab that in AI “the most important areas of progress will be around reasoning ability.”

Other companies like Google, Meta and Microsoft are likewise experimenting with different techniques to improve reasoning in AI models, as are most academic labs that perform AI research. Researchers differ, however, on whether large language models (LLMs) are capable of incorporating ideas and long-term planning into how they do prediction. For instance, one of the pioneers of modern AI, Yann LeCun, who works at Meta, has frequently said that LLMs are not capable of humanlike reasoning.

AI CHALLENGES

Strawberry is a key component of OpenAI’s plan to overcome those challenges, the source familiar with the matter said. The document seen by Reuters described what Strawberry aims to enable, but not how.

In recent months, the company has privately been signaling to developers and other outside parties that it is on the cusp of releasing technology with significantly more advanced reasoning capabilities, according to four people who have heard the company’s pitches. They declined to be identified because they are not authorized to speak about private matters.

Strawberry includes a specialized way of what is known as “post-training” OpenAI’s generative AI models, or adapting the base models to hone their performance in specific ways after they have already been “trained” on reams of generalized data, one of the sources said.

The post-training phase of developing a model involves methods like “fine-tuning,” a process used on nearly all language models today that comes in many flavors, such as having humans give feedback to the model based on its responses and feeding it examples of good and bad answers.

Strawberry has similarities to a method developed at Stanford in 2022 called "Self-Taught Reasoner” or “STaR”, one of the sources with knowledge of the matter said. STaR enables AI models to “bootstrap” themselves into higher intelligence levels via iteratively creating their own training data, and in theory could be used to get language models to transcend human-level intelligence, one of its creators, Stanford professor Noah Goodman, told Reuters.

“I think that is both exciting and terrifying…if things keep going in that direction we have some serious things to think about as humans,” Goodman said. Goodman is not affiliated with OpenAI and is not familiar with Strawberry.

Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.

To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. Reuters was unable to determine what is in that dataset or how long an extended period would mean.

OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.

Reporting by Anna Tong in San Francisco and Katie Paul in New York; editing by Ken Li and Claudia Parsons

bnew · Jul 13, 2024

LLM Generality is a Timeline Crux — LessWrong

Short Summary LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible. …

www.lesswrong.com

LLM Generality is a Timeline Crux

by eggsyntax

9 min read
24th Jun 2024 87 comments

Short Summary

LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.

Longer summary

There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:

Whether this limitation is illusory or actually exists.
If it exists, whether it will be solved by scaling or is a problem fundamental to LLMs.
If fundamental, whether it can be overcome by scaffolding & tooling.

If this is a real and fundamental limitation that can't be fully overcome by scaffolding, we should be skeptical of arguments like Leopold Aschenbrenner's (in his recent 'Situational Awareness') that we can just 'follow straight lines on graphs' and expect AGI in the next few years.

Introduction

Leopold Aschenbrenner's recent ' Situational Awareness' document has gotten considerable attention in the safety & alignment community. Aschenbrenner argues that we should expect current systems to reach human-level given further scaling, and that it's 'strikingly plausible' that we'll see 'drop-in remote workers' capable of doing the work of an AI researcher or engineer by 2027. Others hold similar views.

Francois Chollet and Mike Knoop's new $500,000 prize for beating the ARC benchmark has also gotten considerable recent attention in AIS. Chollet holds a diametrically opposed view: that the current LLM approach is fundamentally incapable of general reasoning, and hence incapable of solving novel problems. We only imagine that LLMs can reason, Chollet argues, because they've seen such a vast wealth of problems that they can pattern-match against. But LLMs, even if scaled much further, will never be able to do the work of AI researchers.

It would be quite valuable to have a thorough analysis of this question through the lens of AI safety and alignment. This post is not that nor is it a review of the voluminous literature on this debate (from outside the AIS community). It attempts to briefly introduce the disagreement, some evidence on each side, and the impact on timelines.

What is general reasoning?

Part of what makes this issue contentious is that there's not a widely shared definition of 'general reasoning', and in fact various discussions of this use various terms. By 'general reasoning', I mean to capture two things. First, the ability to think carefully and precisely, step by step. Second, the ability to apply that sort of thinking in novel situations.

Terminology is inconsistent between authors on this subject; some call this 'system II thinking'; some 'reasoning'; some 'planning' (mainly for the first half of the definition); Chollet just talks about 'intelligence' (mainly for the second half).

This issue is further complicated by the fact that humans aren't fully general reasoners without tool support either. For example, seven-dimensional tic-tac-toe is a simple and easily defined system, but incredibly difficult for humans to play mentally without extensive training and/or tool support. Generalizations that are in-distribution for humans seems like something that any system should be able to do; generalizations that are out-of-distribution for humans don't feel as though they ought to count.

How general are LLMs?

It's important to clarify that this is very much a matter of degree. Nearly everyone was surprised by the degree to which the last generation of state-of-the-art LLMs like GPT-3 generalized; for example, no one I know of predicted that LLMs trained on primarily English-language sources would be able to do translation between languages. Some in the field argued as recently as 2020 that no pure LLM would ever able to correctly complete Three plus five equals. The question is how general they are.

Certainly state-of-the-art LLMs do an enormous number of tasks that, from a user perspective, count as general reasoning. They can handle plenty of mathematical and scientific problems; they can write decent code; they can certainly hold coherent conversations.; they can answer many counterfactual questions; they even predict Supreme Court decisions pretty well. What are we even talking about when we question how general they are?

The surprising thing we find when we look carefully is that they fail pretty badly when we ask them to do certain sorts of reasoning tasks, such as planning problems, that would be fairly straightforward for humans. If in fact they were capable of general reasoning, we wouldn't expect these sorts of problems to present a challenge. Therefore it may be that all their apparent successes at reasoning tasks are in fact simple extensions of examples they've seen in their truly vast corpus of training data. It's hard to internalize just how many examples they've actually seen; one way to think about it is that they've absorbed nearly all of human knowledge.

The weakman version of this argument is the Stochastic Parrot claim, that LLMs are executing relatively shallow statistical inference on an extremely complex training distribution, ie that they're "a blurry JPEG of the web" ( Ted Chiang). This view seems obviously false at this point (given that, for example, LLMs appear to build world models), but assuming that LLMs are fully general may be an overcorrection.

Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we're talking about what they can do under typical auto-regressive conditions like chat.

bnew · Jul 13, 2024

Evidence for generality

I take this to be most people's default view, and won't spend much time making the case. GPT-4 and Claude 3 Opus seem obviously be capable of general reasoning. You can find places where they hallucinate, but it's relatively hard to find cases in most people's day-to-day use where their reasoning is just wrong. But if you want to see the case made explicitly, see for example "Sparks of AGI" (from Microsoft, on GPT-4) or recent models' performance on benchmarks like MATH which are intended to judge reasoning ability.

Further, there's been a recurring pattern (eg in much of Gary Marcus's writing) of people claiming that LLMs can never do X, only to be promptly proven wrong when the next version comes out. By default we should probably be skeptical of such claims.

One other thing worth noting is that we know from 'The Expressive Power of Transformers with Chain of Thought' that the transformer architecture is capable of general reasoning under autoregressive conditions. That doesn't mean LLMs trained on next-token prediction learn general reasoning, but it means that we can't just rule it out as impossible.

Evidence against generality

The literature here is quite extensive, and I haven't reviewed it all. Here are three examples that I personally find most compelling. For a broader and deeper review, see "A Survey of Reasoning with Foundation Models".

Block world

All LLMs to date fail rather badly at classic problems of rearranging colored blocks. We do see improvement with scale here, but if these problems are obfuscated, performance of even the biggest LLMs drops to almost nothing.

Scheduling

LLMs currently do badly at planning trips or scheduling meetings between people with availability constraints [a commenter points out that this paper has quite a few errors, so it should likely be treated with skepticism].

ARC-AGI

Current LLMs do quite badly on the ARC visual puzzles, which are reasonably easy for smart humans.

The evidence on this is somewhat mixed. Evidence that it will includes LLMs doing better on many of these tasks as they scale. The strongest evidence that it won't is that LLMs still fail miserably on block world problems once you obfuscate the problems (to eliminate the possibility that larger LLMs only do better because they have a larger set of examples to draw from).

One argument made by Sholto Douglas and Trenton Bricken (in a discussion with Dwarkesh Patel) is that this is a simple matter of reliability -- given a 5% failure rate, an AI will most often fail to successfully execute a task that requires 15 correct steps. If that's the case, we have every reason to believe that further scaling will solve the problem.

Will scaffolding or tooling solve this problem?

This is another open question. It seems natural to expect that LLMs could be used as part of scaffolded systems that include other tools optimized for handling general reasoning (eg classic planners like STRIPS), or LLMs can be given access to tools (eg code sandboxes) that they can use to overcome these problems. Ryan Greenblatt's new work on getting very good results on ARC with GPT-4o + a Python interpreter provides some evidence for this.

On the other hand, a year ago many expected scaffolds like AutoGPT and BabyAGI to result in effective LLM-based agents, and many startups have been pushing in that direction; so far results have been underwhelming. Difficulty with planning and novelty seems like the most plausible explanation.

Even if tooling is sufficient to overcome this problem, outcomes depend heavily on the level of integration and performance. Currently for an LLM to make use of a tool, it has to use a substantial number of forward passes to describe the call to the tool, wait for the tool to execute, and then parse the response. If this remains true, then it puts substantial constraints on how heavily LLMs can rely on tools without being too slow to be useful. If, on the other hand, such tools can be more deeply integrated, this may no longer apply. Of course, even if it's slow there are some problems where it's worth spending a large amount of time, eg novel research. But it does seem like the path ahead looks somewhat different if system II thinking remains necessarily slow & external.

Why does this matter?

The main reason that this is important from a safety perspective is that it seems likely to significantly impact timelines. If LLMs are fundamentally incapable of certain kinds of reasoning, and scale won't solve this (at least in the next couple of orders of magnitude), and scaffolding doesn't adequately work around it, then we're at least one significant breakthrough away from dangerous AGI -- it's pretty hard to imagine an AI system executing a coup if it can't successfully schedule a meeting with several of its co-conspirator instances.

If, on the other hand, there is no fundamental blocker to LLMs being able to do general reasoning, then Aschenbrenner's argument starts to be much more plausible, that another couple of orders of magnitude can get us to the drop-in AI researcher, and once that happens, further progress seems likely to move very fast indeed.

So this is an area worth keeping a close eye on. I think that progress on the ARC prize will tell us a lot, now that there's half a million dollars motivating people to try for it. I also think the next generation of frontier LLMs will be highly informative -- it's plausible that GPT-4 is just on the edge of being able to effectively do multi-step general reasoning, and if so we should expect GPT-5 to be substantially better at it (whereas if GPT-5 doesn't show much improvement in this area, arguments like Chollet's and Kambhampati's are strengthened).

OK, but what do you think?

I genuinely don't know! It's one of the most interesting and important open questions about the current state of AI. My best guesses are:

LLMs continue to do better at block world and ARC as they scale: 75%
LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan's: 10%
Scaffolding & tools help a lot, so that the next gen (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark 60%
Same but for the gen after that (GPT-6, Claude 5): 75%
The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty

bnew · Jul 16, 2024

[2407.04153] Mixture of A Million Experts
[Submitted on 4 Jul 2024]

Mixture of A Million Experts

Xu Owen He

Abstract:
The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.04153
	arXiv:2407.04153v1

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.04153

A.I Generated explanation{Gemma2-27B-it}:

This is a good start! Here's a slightly more detailed and technically accurate explanation:

Title: **Efficiently using a large number of experts in a Transformer model**

This paper describes a new technique for making Transformer models more efficient.

Here's the problem in more detail:

* Transformers are powerful AI models, especially for understanding and generating language. They do this by analyzing the relationships between words in a sentence, no matter where they are in the sentence. This is great for understanding complex language, but the traditional way of building Transformers means they need a lot of computing power to work well.

* The researchers are trying to address this by creating a more efficient way to use the "experts" (think of them as specialized modules within the model) to improve performance without increasing the size of the model itself.

The problem the researchers are tackling:

* **Computational cost:**

Training large, powerful AI models is expensive, requiring a lot of computational resources and time.

* **Efficiency:**

The researchers want to make these models more efficient without sacrificing their ability to learn.

The solution they propose:

* PEER (Parameter-Efficient Expert Retrieval) is a method that addresses the scaling problem of Transformers by using a technique called "sparse mixture-of-experts". This means instead of having one large model, they use many smaller models, each specializing in a specific aspect of the language.

* Think of it like this: Imagine you have a team of experts, each with a limited area of expertise. Instead of letting them all work on every problem, which would be inefficient, you only select the experts who are relevant to the task at hand. This is what the researchers are aiming for with this new method.

The key is to use "expert routing"

* The researchers want to use a large number of experts (smaller models) to make the model more powerful, but they also want to make sure the model is still efficient.

* The paper proposes a way to make the model more powerful by increasing the number of experts, but it's not clear from this excerpt how they achieve that efficiency.

Possible ways to achieve efficiency in PEER:

* Sparsely activating only a subset of experts:

This means that not all of the experts are used for every input, only the ones that are most relevant.

* Efficient routing mechanisms:

The paper likely proposes a specific method for determining which experts are activated for a given input.

* Efficient training techniques:

The excerpt mentions the paper will likely discuss this, but it's not clear what specific techniques are used.

To understand the full solution, you'd need to read the paper, but the key takeaway is that they're proposing a way to improve the efficiency of AI models by making them more modular and scalable.

1/1
The Mixture of a Million Experts paper is a straight banger.

Reduces inference cost and memory usage, scales to millions of experts, oh and just happens to overcome catastrophic forgetting and enable life long learning for the model.

Previous MOE models never got past 10k experts and they had a static router to connect them up that was inefficient but this includes a learned router than can handle millions of micro experts. Reminds me a bit of how the neocortex works because it is composed of about 2 million cortical columns that can each learn a model of the world and then work together to form a collective picture of reality.

Catastrophic forgetting and continual learning are two of the most important and nasty problems with current architectures and this approach just potentially wiped out both in one shot.

There have been other approaches to try to enable continual learning and overcome catastrophic forgetting like bi-level continual learning or progress and compress, that use elastic weight consolidation, knowledge distillation and two models, a big neural net and a small learning net. The small net learns and over time the learnings are passed back to the big net. Its weights are partially frozen and consolidated as the new knowledge is brought in. Good ideas, also out of Deep Mind robotics teams.

But this paper seems to say you can just add in new mini experts, freeze or partially freeze old weights, and just grow the model's understanding as much as you want, without causing it to lose what it already knows.

It's like having Loras built right into the model itself.

[2407.04153] Mixture of A Million Experts

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

WIA20XX · Jul 16, 2024

@bnew

Do you think this really just a matter of 1) more data, 2) more compute - or is something more fundamental missing?

bnew · Jul 16, 2024

WIA20XX said:
@bnew

Do you think this really just a matter of 1) more data, 2) more compute - or is something more fundamental missing?

as far as reasoning goes I think it's going to be a combinations of a lot of things other than data and compute that will need to be refined. theres a ton of research that has been put out based on smaller language models due to the cost of training and testing these models so we don't even know how much improvement we'd see with scale and they've yet to be employed in major models in production use today.

Reasoning skills of large language models are often overestimated

Veteran

Reasoning skills of large language models are often overestimated​

Veteran

Veteran

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks​

Submission history​

Veteran

4 Results​

Veteran

Veteran

Multi-LogiEval - Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models​

Submission history​

Superstar

Veteran

Veteran

Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization​

Submission history​

Veteran

Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’​

AI CHALLENGES​

Veteran

LLM Generality is a Timeline Crux​

Short Summary​

Longer summary​

Introduction​

What is general reasoning?​

How general are LLMs?​

Veteran

Evidence for generality​

Evidence against generality​

Block world​

Scheduling​

ARC-AGI​

​

Will scaffolding or tooling solve this problem?​

Why does this matter?​

OK, but what do you think?​

Veteran

Mixture of A Million Experts​

Submission history​

Here's the problem in more detail:​

The problem the researchers are tackling:​

The solution they propose:​

Possible ways to achieve efficiency in PEER:​

Superstar

Veteran

Reasoning skills of large language models are often overestimated

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Submission history

4 Results

Multi-LogiEval - Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Submission history

Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization

Submission history

Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’

AI CHALLENGES

LLM Generality is a Timeline Crux

Short Summary

Longer summary

Introduction

What is general reasoning?

How general are LLMs?

Evidence for generality

Evidence against generality

Block world

Scheduling

ARC-AGI

Will scaffolding or tooling solve this problem?

Why does this matter?

OK, but what do you think?

Mixture of A Million Experts

Submission history

Here's the problem in more detail:

The problem the researchers are tackling:

The solution they propose:

Possible ways to achieve efficiency in PEER: