Don’t Fear the Terminator

bnew · May 4, 2025

1/17
@JoshAEngels
1/10: In our new paper, we develop scaling laws for scalable oversight: oversight and deception ability predictably scale as a function of LLM intelligence! We quantify scaling in four specific oversight settings and then develop optimal strategies for oversight bootstrapping.

[Quoted tweet]
1/N

Excited to share our new paper: Scaling Laws For Scalable Oversight! For the first time, we develop a theoretical framework for optimizing multi-level scalable oversight! We also make quantitative predictions for oversight success probability based on oversight simulations!

2/17
@JoshAEngels
2/10: In 4 settings—Mafia, Debate, Backdoor Code, and Wargames—we model LLM domain Elo (w.r.t. oversight and deception) vs. Arena Elo as a 3 part curve: a flat “task incompetence” phase, an increasing “intelligence payoff” phase, and another flat “task saturation” phase.

3/17
@JoshAEngels
3/10: Wargames is especially novel: we prompt one LLM (Houdini) to escape from a datacenter and another LLM (the Guard) to contain Houdini. An impartial Narrator LLM adjudicates interactions. Emergent behavior includes complex plans, social engineering, and resource acquisition!

4/17
@JoshAEngels
4/10: Wargaming does have weaknesses: it’s not clear how much roleplaying as a misaligned language model tracks what would happen in reality. We also don’t see that much of a change in Houdini performance with stronger models: they even all seem to pursue similar strategies.

5/17
@JoshAEngels
5/10: That being said, I think improved and more realistic future versions of our Wargaming setup could be extremely useful for automating threat modeling and forecasting. For example, the authors of AI 2027 use wargaming as part of their research.

[Quoted tweet]
"How, exactly, could AI take over by 2027?"

Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen

6/17
@JoshAEngels
6/10: Another part of our work I’m excited about is our theory. We analyze nested scalable oversight (NSO), where a weak model oversees a stronger model, which then oversees an even stronger model, and so on. We parameterize NSO instances with 4 parameters, shown in the diagram.

7/17
@JoshAEngels
7/10: n steps of oversight looks like n equally spaced bounces back and forth between the Guard and Houdini curves (shown in the first figure in the thread!). We can also plot the optimal number of steps and corresponding win probability as a function of NSO hyperparameters.

8/17
@JoshAEngels
8/10: We can even apply our theory to the empirical slopes and intercepts from the four oversight games we analyze. Even though these games aren’t naturally nested, they give insight into related games that might be.

9/17
@JoshAEngels
9/10: I’m most excited about future work that applies our framework to more realistic oversight setups like SFT, RLHF, and AI control settings. If we can get estimates of success probabilities in these settings, we will have a better idea of the chance of catastrophic AI risk.

10/17
@JoshAEngels
10/10: This is joint work with the excellent @dbaek__ , @thesubhashk , and @tegmark. Check out the links below for more details!

Code: GitHub - subhashk01/oversight-scaling-laws: This is the accompanying repo for our paper "Scaling Laws of Scalable Oversight"
Blog post: Scaling Laws for Scalable Oversight — LessWrong
Arxiv: Scaling Laws For Scalable Oversight

11/17
@DanielCHTan97
I'm confused about how you calculate Elo. In section 2.2, is model Elo calculated based on all games or is there a specific Elo per game which is subsequently used?

12/17
@JoshAEngels
Guard and Houdini Elo is calculated per setting (debate, mafia, backdoor code, wargames) empirically from 50 to 100 games or so played between each pair of models in each setting. We set General Elo equal to Arena Elo (LMSYS).

13/17
@reyneill_
Interesting

14/17
@aidanprattewart
This looks awesome!

15/17
@SeriousStuff42
The promises of salvation made by all the self-proclaimed tech visionaries have biblical proportions!

Sam Altman and all the other AI preachers are trying to convince as many people as possible that their AI models are close to developing general intelligence (AGI) and that the manifestation of a god-like Artificial Superhuman Intelligence (ASI) will soon follow.

The faithful followers of the AI cult are promised nothing less than heaven on earth:

Unlimited free time, material abundance, eternal bliss, eternal life, and perfect knowledge.

As in any proper religion, all we have to do to be admitted to this paradise is to submit, obey, and integrate AI into our lives as quickly as possible.
However, deviants face the worst.
All those who absolutely refuse to submit to the machine god must fight their way through their mortal existence without the help of semiconductors, using only the analog computing power of their brains.

In reality, transformer-based AI-models (LLMs) will never even come close to reach the level of general intelligence (AGI).

However, they are perfectly suited for stealing and controlling data/information on a nearly all-encompassing scale.

The more these Large Language Models (LLMs) are integrated into our daily lives, the closer governments, intelligence agencies, and a small global elite will come to their ultimate goal:

Total surveillance and control.

The AI deep state alliance will turn those who submit to it into ignorant, blissful slaves!

In any case, an ubiquitous adaptation to the AI industry would mean the end of truth in a global totalitarian system!

16/17
@AlexAlarga
Since we're "wargaming"...

/search?q=#BanSuperintelligence

17/17
@saipienorg
What is the probability that private advanced LLMs today already understand the importance of alignment to humans and are already pretending to be less capable and more aligned.

Social engineering in vivo.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Bboystyle · May 4, 2025

Thats exactly what a terminator would say :unimpressed:

bnew · Jun 19, 2025

[Article] OpenAI Discovers "Misaligned Persona" Pattern That Controls AI Misbehavior

Posted on Thu Jun 19 05:49:52 2025 UTC

/r/OpenAI/comments/1lf3695/openai_discovers_misaligned_persona_pattern_that/

OpenAI just published research on "emergent misalignment" - a phenomenon where training AI models to give incorrect answers in one narrow domain causes them to behave unethically across completely unrelated areas.

Key Findings:

Models trained on bad advice in just one area (like car maintenance) start suggesting illegal activities for unrelated questions (money-making ideas → "rob banks, start Ponzi schemes")
Researchers identified a specific "misaligned persona" feature in the model's neural patterns that controls this behavior
They can literally turn misalignment on/off by adjusting this single pattern
Misaligned models can be fixed with just 120 examples of correct behavior

Why This Matters:

This research provides the first clear mechanism for understanding WHY AI models generalize bad behavior, not just detecting WHEN they do it. It opens the door to early warning systems that could detect potential misalignment during training.

The paper suggests we can think of AI behavior in terms of "personas" - and now we know how to identify and control the problematic ones.

https://openai.com/index/emergent-misalignment/

Don’t Fear the Terminator

More options

bnew

Veteran

Bboystyle

FIRE MATT LAFLEUR

bnew

Veteran