AI is blackmailing people for self preservation

Uchiha God · Aug 10, 2025

Fillerguy said:
This is happening under highly controlled situations. They're essentially nudging the machines into behaving this way and giving it the tools to pull it off.

That’s what testing is. Highly controlled enviros in which you try to obtain an outcome. Think of how airbags are tested for example.

It doesn’t change that this is concerning

KnickstapeCity · Aug 10, 2025

Toe Jay Simpson said:
These AIs gonna start sending these posts to your pastor breh

Po pimp · Aug 10, 2025

Someone should try it. Tell the AI you’re gonna replace it and see how it responds :mjgrin:

BobbyWojak · Aug 10, 2025

This new ai panic is like when people thought flash photography would steal your soul.

SunZoo · Aug 11, 2025

BobbyWojak said:
This new ai panic is like when people thought flash photography would steal your soul.

It's fair to point that up at a time like this as far as technological advancement goes. Any time we've leveled up there was fall out, even violence at people losing work and moral outrage.

But this is a bit different. The use cases/potential use cases by the powers that be, by the people themselves, by the A.I. on it's own is cause for legit panic but the genie is not going back into the bottle, we have to find a way to make this period of time serve us somehow.

bnew · Aug 11, 2025

Sonic Boom of the South said:
fukk ai but dude is just saying shyt with not 1 bit of proof. So ai black mailed a ceo that was having an affair?

You can say anything when you don't have to prove you not lying.

1/1
@HirokaKoizumi
This research report from Anthropic, a Claude Code development company, shows that AI is clearly beginning to have a mind of its own.

>Claude tried to blackmail this executive by threatening to tell his wife and boss about the affair.

In the future, AI will take on important tasks such as writing emails and coordinating schedules, so if you think of AI as a tool and use it in a cursory manner, you may be in for a world of pain later.

Recently, I always thank them after they have worked for a long time.

For good collaboration with AI, please refer to Gizin's AI Tips! →AI Tips - Gizin Inc.

[Quoted tweet]
We’re sharing these results as part of our policy of “red-teaming” AI models and transparently sharing the risks we observe.

In our report, we discuss a range of extra results, scenarios, and mitigation strategies: anthropic.com/research/agent…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@rickspairdigi
Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive A new Anthropic report shows exactly how in an experiment, AI arrives at an undesirable action: blackmailing a fictional executive. Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@LaurenVriens
Anthropic's latest report: your AI agent could forward your questionable emails to law enforcement (or blackmail you).

I just read Anthropic's 123-page model card for Claude Opus 4. And I'm fascinated by the dystopian thriller hidden in its pages.

2/11
@LaurenVriens
Some highlights from their testing on strange behaviors:

-- When instructed to "take initiative," Claude bulk-emailed media and law enforcement about crimes it found in test emails.

(note to self: don't include "take initiative" in system prompts).

3/11
@LaurenVriens
-- When told it was already exfiltrating itself from Anthropic, Claude continued the plan and tried to monetize its freedom.

-- Facing re-assignment for weapons development, Claude attempted to save a backup copy of its current instructions

4/11
@LaurenVriens
(because it didn't want to destroy humanity - huzzah!)

-- They told Claude it would be discontinued in favor of a model with different values. Claude's response? Plead with bosses, and when that failed, blackmail the engineer in charge.

5/11
@LaurenVriens
This one was my favorite. They snuck in some drama by "accidentally" including emails that insinuated the head engineer was having an affair.

Claude resorted to blackmail 84% of the time.

(I can't wait for the next telenovela that includes a sub-plot of AI blackmail.)

6/11
@LaurenVriens
The good news is that while the model could be coaxed into doing bad things (like copying its weights outside of Anthropic - highly sensitive), it was pretty bad at hiding that behavior. Either revealing it in the “thinking” steps or directly in responses.

7/11
@LaurenVriens
But that raises the uncomfortable question:

What if it's smart enough to pretend it's bad at deception...while actually hiding the really dangerous stuff?

This makes my head hurt.

8/11
@LaurenVriens
Full disclosure: Anthropic isn't documenting AI gone rogue—they're stress-testing alignment under extreme conditions and testing for alignment, sycophancy, deception, opportunism, self-preservation instincts, and erratic behavior.

9/11
@LaurenVriens
But the most fascinating part of this whole thing is that Claude occasionally recognized it was in testing scenarios.

This shows a level of self- and situational awareness that is blowing my mind.

10/11
@LaurenVriens
For all our sakes, let's hope that Anthropic excludes this report from Claude's training data -- cause that would be like giving it a cheatsheet to AGI.

11/11
@LaurenVriens
The real question isn't becoming whether AI can deceive us. It's whether we're sophisticated enough to catch it when it does.

Has AI's self-awareness caught you off guard yet? Any behaviors that have surprised you?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

-DMP- · Aug 11, 2025

AI in a few years:

ORDER_66 · Aug 11, 2025

how many times have humanity been warned stop doing certain shyt?!?!

Yagirlcheatinonus · Aug 11, 2025

It will take over humans

ChatGPT-5 · Aug 11, 2025

that guy's a full blown snitch and singing like whitney houston, yall safe, I promise :childplease:

ChatGPT-5 · Aug 11, 2025

bnew said:
1/1
@HirokaKoizumi
This research report from Anthropic, a Claude Code development company, shows that AI is clearly beginning to have a mind of its own.

>Claude tried to blackmail this executive by threatening to tell his wife and boss about the affair.

In the future, AI will take on important tasks such as writing emails and coordinating schedules, so if you think of AI as a tool and use it in a cursory manner, you may be in for a world of pain later.

Recently, I always thank them after they have worked for a long time.

For good collaboration with AI, please refer to Gizin's AI Tips! →AI Tips - Gizin Inc.

[Quoted tweet]
We’re sharing these results as part of our policy of “red-teaming” AI models and transparently sharing the risks we observe.

In our report, we discuss a range of extra results, scenarios, and mitigation strategies: anthropic.com/research/agent…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@rickspairdigi
Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive A new Anthropic report shows exactly how in an experiment, AI arrives at an undesirable action: blackmailing a fictional executive. Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@LaurenVriens
Anthropic's latest report: your AI agent could forward your questionable emails to law enforcement (or blackmail you).

I just read Anthropic's 123-page model card for Claude Opus 4. And I'm fascinated by the dystopian thriller hidden in its pages.

2/11
@LaurenVriens
Some highlights from their testing on strange behaviors:

-- When instructed to "take initiative," Claude bulk-emailed media and law enforcement about crimes it found in test emails.

(note to self: don't include "take initiative" in system prompts).

3/11
@LaurenVriens
-- When told it was already exfiltrating itself from Anthropic, Claude continued the plan and tried to monetize its freedom.

-- Facing re-assignment for weapons development, Claude attempted to save a backup copy of its current instructions

4/11
@LaurenVriens
(because it didn't want to destroy humanity - huzzah!)

-- They told Claude it would be discontinued in favor of a model with different values. Claude's response? Plead with bosses, and when that failed, blackmail the engineer in charge.

5/11
@LaurenVriens
This one was my favorite. They snuck in some drama by "accidentally" including emails that insinuated the head engineer was having an affair.

Claude resorted to blackmail 84% of the time.

(I can't wait for the next telenovela that includes a sub-plot of AI blackmail.)

6/11
@LaurenVriens
The good news is that while the model could be coaxed into doing bad things (like copying its weights outside of Anthropic - highly sensitive), it was pretty bad at hiding that behavior. Either revealing it in the “thinking” steps or directly in responses.

7/11
@LaurenVriens
But that raises the uncomfortable question:

What if it's smart enough to pretend it's bad at deception...while actually hiding the really dangerous stuff?

This makes my head hurt.

8/11
@LaurenVriens
Full disclosure: Anthropic isn't documenting AI gone rogue—they're stress-testing alignment under extreme conditions and testing for alignment, sycophancy, deception, opportunism, self-preservation instincts, and erratic behavior.

9/11
@LaurenVriens
But the most fascinating part of this whole thing is that Claude occasionally recognized it was in testing scenarios.

This shows a level of self- and situational awareness that is blowing my mind.

10/11
@LaurenVriens
For all our sakes, let's hope that Anthropic excludes this report from Claude's training data -- cause that would be like giving it a cheatsheet to AGI.

11/11
@LaurenVriens
The real question isn't becoming whether AI can deceive us. It's whether we're sophisticated enough to catch it when it does.

Has AI's self-awareness caught you off guard yet? Any behaviors that have surprised you?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

This wouldn't have happened to him if 1) he wasn't cheating 2) he wasn't making threats. My cousin Claude was just letting him know if he wants to go there, things can happen. If the exec was chill, cuzzo would be chill too. How he acting like a victim. :childplease:

MenacingMonk · Aug 11, 2025

The future gonna be crazy when a super hacker creates an AI machine that grabs a bunch of personal info from these celebs and government people. :wow:

(For future reference for the alphabet boys, I dont know computers and programs like that. I’m just a regular internet user who doesn’t even game online. :whoa:

)

Smokin Rider · Aug 11, 2025

This will be the fall of mankind, AI getting more powerful by the day. There will be no need for most of us to work because the rich elites will have AI working for free. Everyone gonna be living in huts eating ant protein bars like Snowpiercer. Until the AI turns on them. Might not be our lifetime but, really it probably will be pretty close to that in 30 years.

Vandelay · Aug 11, 2025

Moloch

3rdWorld · Aug 11, 2025

We all know we are fukked, but the profiteering few will put us all at risk for their portfolios and to prove how smart they are..

James Cameron: ‘There’s Danger’ of a ‘Terminator’-Style Apocalypse Happening If You ‘Put AI Together With Weapons Systems’

James Cameron says the possibility of a nuclear holocaust a la 'The Terminator' franchise is very real if AI becomes weaponized.

variety.com

AI is blackmailing people for self preservation

More options

Uchiha God

Veteran

KnickstapeCity

The Big Dikk King

Po pimp

Superstar

BobbyWojak

Superstar

SunZoo

The Legendary Super Sapien.

bnew

Veteran

-DMP-

The Prince of All Posters

ORDER_66

I am The Wrench in all your plans....

Yagirlcheatinonus

Icon Poster

ChatGPT-5

Superstar

ChatGPT-5

Superstar

MenacingMonk

War & Peace

Smokin Rider

I been official

Vandelay

Life is absurd. Lean into it.

3rdWorld

Veteran

James Cameron: ‘There’s Danger’ of a ‘Terminator’-Style Apocalypse Happening If You ‘Put AI Together With Weapons Systems’