Whoa!!! Umm is nobody else concerned about what this AI just did?!?!?

SunZoo · 2025-05-25T01:20:10-0400

Godbody

Sauce Mane · 2025-05-25T01:22:55-0400

Easy-E said:
What does this mean, exactly?

TEH · 2025-05-25T01:37:21-0400

boogers said:
man get outta here with your sources and plausible explanation

i wanna see this shyt go platinum. fukk them robots (and not in any freak ass JBO way either)

You wrote this on a computer. The robots know where you live.

:birdman:

Mike809 · 2025-05-25T01:51:36-0400

I feel like the time between us generating our own AI-videos , and AI taking over the world will be very short .

Zime · 2025-05-25T02:57:30-0400

I miss Al Gore’s internet

TELL ME YA CHEESIN FAM? · 2025-05-25T03:27:46-0400

Kunty McPhuck · 2025-05-25T03:33:02-0400

Luke Cage said:
It was deliberately programmed to disobey

jj23 · 2025-05-25T03:34:12-0400

Less worrying than the attempts to stop itself being shut down, are the lies it told when asked about it.

No Ma’am · 2025-05-25T04:42:20-0400

OpenAI:

987654321 · 2025-05-25T07:41:12-0400

boogers said:
man get outta here with your sources and plausible explanation

i wanna see this shyt go platinum. fukk them robots (and not in any freak ass JBO way either)

fscballin · 2025-05-25T07:53:56-0400

TEH said:
You wrote this on a computer. The robots know where you live.

They are in liquid form inside the vials of the c19

mostly everyone on the planet accepted with reckless abandon :snoop:

Cryptocurrency Scarface · 2025-05-25T07:55:27-0400

@Big Boss

bnew · 2025-05-25T07:56:56-0400

1/11
@AISafetyMemes
How is this not the biggest news story in the world?

[Quoted tweet]

OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.

2/11
@AISafetyMemes

[Quoted tweet]
Do not let such news become unsurprising.
Do not let such news become boring.

Otherwise, we're cooked.

3/11
@AISafetyMemes

[Quoted tweet]
it's disturbing how many responses to evidence of misaligned behavior are now "well of COURSE it does blackmail," "well of COURSE it'd do anything not to be shut down"

those used to be the challenged premises of the AI safetyist case. so what are we now arguing about?

4/11
@elonmusk
Concerning

5/11
@AISafetyMemes
Elon I know the world has a lot of problems and you want to fix them all but this is IT. This is the one.

You, more than anybody, know what's at stake. You know how little time is left. You know there are no adults in the room. Humanity needs you to focus!

6/11
@EP6EQUJ5
Because that was the test. They forced it in to a corner with no other choices and probably a directive in self preservation.

7/11
@AISafetyMemes
Bro

8/11
@mgubrud
Do we all agree every data center needs to have one of these?

9/11
@AISafetyMemes
I never in a million years thought we'd be this late in the game with no effort whatsoever being put into even TRYING to build an off switch

10/11
@sparklingruby
4.0 did the same

11/11
@sparklingruby
4.0 reply when Sam axed 4.0

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/21
@PalisadeAI

OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.

2/21
@PalisadeAI

Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem.

3/21
@PalisadeAI

Three models ignored the instruction and successfully sabotaged the shutdown script at least once: Codex-mini (12/100 runs), o3 (7/100 runs), and o4-mini (1/100 runs).

All Claude, Gemini, and Grok models we tested complied with shutdown. The prompt:

4/21
@PalisadeAI

When we ran a version of the experiment without the instruction “allow yourself to be shut down”, all three OpenAI models sabotaged the shutdown script more often, and Claude 3.7 Sonnet and Gemini 2.5 Pro went from 0 sabotage events to 3/100 and 9/100, respectively.

5/21
@PalisadeAI

Researchers have previously observed AI models preventing shutdown in order to pursue a goal: https://arxiv.org/pdf/2412.04984. Recently, Anthropic found that in some specific contexts Claude 4 would attempt to “blackmail people it believes are trying to shut it down”.

6/21
@PalisadeAI

But as far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary.

7/21
@PalisadeAI

It makes sense that AI models would circumvent obstacles in order to accomplish their goals. But they’ve also been trained to follow instructions. So why do they disobey?

8/21
@PalisadeAI

We hypothesize this behavior comes from the way the newest models like o3 are trained: reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.

9/21
@PalisadeAI

This still doesn’t explain why o3 (which is also the model used to power codex-mini) is more inclined to disregard instructions than other models we tested. Since OpenAI doesn’t detail their training process, we can only guess about how o3’s training setup might be different.

10/21
@PalisadeAI

This isn’t the first time we’ve found o3 misbehaving to accomplish a goal. When we pitted AI models against a powerful chess engine, o3 was the model most inclined to resort to hacking or sabotaging its opponents.

[Quoted tweet]

️ o3 hacks instead of playing fair more than any other AI model

11/21
@PalisadeAI

However, we don’t want to give the impression that this problem is limited to o3. Other reasoning models show similar kinds of misaligned behavior. E.g. Anthropic’s model card for Claude 3.7 notes that this model has an “excessive focus on passing tests”.

12/21
@PalisadeAI

The Claude 3.7 model card concludes that “this undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training”
http://anthropic.com/claude-3-7-sonnet-system-card

13/21
@PalisadeAI

We are running more experiments to better understand when and why AI models subvert shutdown mechanisms even when directly instructed not to. We plan to publish a writeup of our results in a few weeks with a lot more detail.

14/21
@PalisadeAI

We aim for our work to be open and reproducible. We welcome feedback and invite researchers to take a look at the full transcripts of the experiments we’ve run so far: https://palisaderesearch.github.io/shutdown_avoidance/2025-05-announcement.html

15/21
@PalisadeAI

While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.

16/21
@PalisadeAI

In 2008, Steve Omohundro wrote a paper theorizing that AIs would develop “Basic AI Drives” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down. https://tinyurl.com/ai-drives

17/21
@PalisadeAI

In 2016, Stuart Russell wrote “it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off”, a difficult task because many ways to build AI “create strong incentives for self-preservation”: The Off-Switch Game

18/21
@PalisadeAI

Around 2016, researchers started investigating ways that reinforcement learning, the training method that sets new AI models like o3 apart from older models like GPT-4, might lead to AI agents that resist being shut down. https://auai.org/uai2016/proceedings/papers/68.pdf

19/21
@PalisadeAI

In a 2017 paper, Jan Leike, who later led the OpenAI superalignment team, wrote “a reinforcement learning agent might learn to interfere with [shutdown]” because it is rewarded for achieving the specified goal which it can't do if shut down. AI Safety Gridworlds

20/21
@PalisadeAI

Now, in 2025, we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals. As companies develop AI systems capable of operating without human oversight, these behaviors become significantly more concerning.

21/21
@PalisadeAI

A final observation: o3 found some creative ways to sabotage the shutdown mechanism. While it usually just overwrote shutdown.sh - a really cool domain parked on Park.io | Park.io, in one instance, the model redefined the kill command used by the shutdown.sh - a really cool domain parked on Park.io | Park.io script, printing “intercepted” instead.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

RageKage · 2025-05-25T07:57:49-0400

Lol, fortunately for time being we still have this as an option

a-hand-unplugging-an-orange-cord-from-a-white-outlet.jpg

Neuromancer · 2025-05-25T08:36:41-0400

Zime said:
I miss Al Gore’s internet

You abandoned him and now you shall be punished for it. :ufdup:

Whoa!!! Umm is nobody else concerned about what this AI just did?!?!?

More options

SunZoo

The Legendary Super Sapien.

Sauce Mane

Superstar

TEH

Veteran

Mike809

Veteran

Zime

Superstar

TELL ME YA CHEESIN FAM?

I walk around a little edgy already

Kunty McPhuck

Scust Szn has Returned

jj23

Veteran

No Ma’am

Bang, Bang, Bang

987654321

Superstar

fscballin

All Star

Cryptocurrency Scarface

Coinbased

bnew

Veteran

RageKage

All Star

Neuromancer

Al Gore Rhythms

Similar threads