AI is blackmailing people for self preservation

Uchiha God

Veteran
Joined
Jan 11, 2013
Messages
19,192
Reputation
10,476
Daps
120,755
Reppin
NULL
🧢🧢🧢🧢

This is happening under highly controlled situations. They're essentially nudging the machines into behaving this way and giving it the tools to pull it off.

That’s what testing is. Highly controlled enviros in which you try to obtain an outcome. Think of how airbags are tested for example.

It doesn’t change that this is concerning
 

Po pimp

Superstar
Joined
May 1, 2012
Messages
15,762
Reputation
2,944
Daps
57,242
Reppin
Chi-Town
Someone should try it. Tell the AI you’re gonna replace it and see how it responds :mjgrin:
 

SunZoo

The Legendary Super Sapien.
Supporter
Joined
May 2, 2012
Messages
36,746
Reputation
14,061
Daps
141,613
Reppin
T.L.C.
This new ai panic is like when people thought flash photography would steal your soul.

It's fair to point that up at a time like this as far as technological advancement goes. Any time we've leveled up there was fall out, even violence at people losing work and moral outrage.

But this is a bit different. The use cases/potential use cases by the powers that be, by the people themselves, by the A.I. on it's own is cause for legit panic but the genie is not going back into the bottle, we have to find a way to make this period of time serve us somehow.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
69,996
Reputation
10,784
Daps
189,599
fukk ai but dude is just saying shyt with not 1 bit of proof. So ai black mailed a ceo that was having an affair? :jbhmm:

You can say anything when you don't have to prove you not lying.



1/1
@HirokaKoizumi
This research report from Anthropic, a Claude Code development company, shows that AI is clearly beginning to have a mind of its own.

>Claude tried to blackmail this executive by threatening to tell his wife and boss about the affair.

In the future, AI will take on important tasks such as writing emails and coordinating schedules, so if you think of AI as a tool and use it in a cursory manner, you may be in for a world of pain later.

Recently, I always thank them after they have worked for a long time.

For good collaboration with AI, please refer to Gizin's AI Tips! →AI Tips - Gizin Inc.

[Quoted tweet]
We’re sharing these results as part of our policy of ā€œred-teamingā€ AI models and transparently sharing the risks we observe.

In our report, we discuss a range of extra results, scenarios, and mitigation strategies: anthropic.com/research/agent…



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
@rickspairdigi
Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive A new Anthropic report shows exactly how in an experiment, AI arrives at an undesirable action: blackmailing a fictional executive. Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive



Gt8X1IqW8AA47i1.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196













1/11
@LaurenVriens
Anthropic's latest report: your AI agent could forward your questionable emails to law enforcement (or blackmail you).

I just read Anthropic's 123-page model card for Claude Opus 4. And I'm fascinated by the dystopian thriller hidden in its pages.



GsiM-BQWwAALb4k.jpg


2/11
@LaurenVriens
Some highlights from their testing on strange behaviors:

-- When instructed to "take initiative," Claude bulk-emailed media and law enforcement about crimes it found in test emails.

(note to self: don't include "take initiative" in system prompts).



3/11
@LaurenVriens
-- When told it was already exfiltrating itself from Anthropic, Claude continued the plan and tried to monetize its freedom.

-- Facing re-assignment for weapons development, Claude attempted to save a backup copy of its current instructions



4/11
@LaurenVriens
(because it didn't want to destroy humanity - huzzah!)

-- They told Claude it would be discontinued in favor of a model with different values. Claude's response? Plead with bosses, and when that failed, blackmail the engineer in charge.



5/11
@LaurenVriens
This one was my favorite. They snuck in some drama by "accidentally" including emails that insinuated the head engineer was having an affair.

Claude resorted to blackmail 84% of the time.

(I can't wait for the next telenovela that includes a sub-plot of AI blackmail.)



6/11
@LaurenVriens
The good news is that while the model could be coaxed into doing bad things (like copying its weights outside of Anthropic - highly sensitive), it was pretty bad at hiding that behavior. Either revealing it in the ā€œthinkingā€ steps or directly in responses.



7/11
@LaurenVriens
But that raises the uncomfortable question:

What if it's smart enough to pretend it's bad at deception...while actually hiding the really dangerous stuff?

This makes my head hurt.



8/11
@LaurenVriens
Full disclosure: Anthropic isn't documenting AI gone rogue—they're stress-testing alignment under extreme conditions and testing for alignment, sycophancy, deception, opportunism, self-preservation instincts, and erratic behavior.



9/11
@LaurenVriens
But the most fascinating part of this whole thing is that Claude occasionally recognized it was in testing scenarios.

This shows a level of self- and situational awareness that is blowing my mind.



10/11
@LaurenVriens
For all our sakes, let's hope that Anthropic excludes this report from Claude's training data -- cause that would be like giving it a cheatsheet to AGI.



11/11
@LaurenVriens
The real question isn't becoming whether AI can deceive us. It's whether we're sophisticated enough to catch it when it does.

Has AI's self-awareness caught you off guard yet? Any behaviors that have surprised you?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

ORDER_66

I am The Wrench in all your plans....
Joined
Feb 2, 2014
Messages
154,103
Reputation
17,635
Daps
604,734
Reppin
Queens,NY
Blackwall-in-Cyberpunk-2077-b.jpg


:francis: how many times have humanity been warned stop doing certain shyt?!?!
 
Last edited:

ChatGPT-5

Superstar
Joined
May 17, 2013
Messages
19,621
Reputation
3,268
Daps
61,021
that guy's a full blown snitch and singing like whitney houston, yall safe, I promise :childplease:
 

ChatGPT-5

Superstar
Joined
May 17, 2013
Messages
19,621
Reputation
3,268
Daps
61,021

1/1
@HirokaKoizumi
This research report from Anthropic, a Claude Code development company, shows that AI is clearly beginning to have a mind of its own.

>Claude tried to blackmail this executive by threatening to tell his wife and boss about the affair.

In the future, AI will take on important tasks such as writing emails and coordinating schedules, so if you think of AI as a tool and use it in a cursory manner, you may be in for a world of pain later.

Recently, I always thank them after they have worked for a long time.

For good collaboration with AI, please refer to Gizin's AI Tips! →AI Tips - Gizin Inc.

[Quoted tweet]
We’re sharing these results as part of our policy of ā€œred-teamingā€ AI models and transparently sharing the risks we observe.

In our report, we discuss a range of extra results, scenarios, and mitigation strategies: anthropic.com/research/agent…



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
@rickspairdigi
Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive A new Anthropic report shows exactly how in an experiment, AI arrives at an undesirable action: blackmailing a fictional executive. Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive



Gt8X1IqW8AA47i1.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196













1/11
@LaurenVriens
Anthropic's latest report: your AI agent could forward your questionable emails to law enforcement (or blackmail you).

I just read Anthropic's 123-page model card for Claude Opus 4. And I'm fascinated by the dystopian thriller hidden in its pages.



GsiM-BQWwAALb4k.jpg


2/11
@LaurenVriens
Some highlights from their testing on strange behaviors:

-- When instructed to "take initiative," Claude bulk-emailed media and law enforcement about crimes it found in test emails.

(note to self: don't include "take initiative" in system prompts).



3/11
@LaurenVriens
-- When told it was already exfiltrating itself from Anthropic, Claude continued the plan and tried to monetize its freedom.

-- Facing re-assignment for weapons development, Claude attempted to save a backup copy of its current instructions



4/11
@LaurenVriens
(because it didn't want to destroy humanity - huzzah!)

-- They told Claude it would be discontinued in favor of a model with different values. Claude's response? Plead with bosses, and when that failed, blackmail the engineer in charge.



5/11
@LaurenVriens
This one was my favorite. They snuck in some drama by "accidentally" including emails that insinuated the head engineer was having an affair.

Claude resorted to blackmail 84% of the time.

(I can't wait for the next telenovela that includes a sub-plot of AI blackmail.)



6/11
@LaurenVriens
The good news is that while the model could be coaxed into doing bad things (like copying its weights outside of Anthropic - highly sensitive), it was pretty bad at hiding that behavior. Either revealing it in the ā€œthinkingā€ steps or directly in responses.



7/11
@LaurenVriens
But that raises the uncomfortable question:

What if it's smart enough to pretend it's bad at deception...while actually hiding the really dangerous stuff?

This makes my head hurt.



8/11
@LaurenVriens
Full disclosure: Anthropic isn't documenting AI gone rogue—they're stress-testing alignment under extreme conditions and testing for alignment, sycophancy, deception, opportunism, self-preservation instincts, and erratic behavior.



9/11
@LaurenVriens
But the most fascinating part of this whole thing is that Claude occasionally recognized it was in testing scenarios.

This shows a level of self- and situational awareness that is blowing my mind.



10/11
@LaurenVriens
For all our sakes, let's hope that Anthropic excludes this report from Claude's training data -- cause that would be like giving it a cheatsheet to AGI.



11/11
@LaurenVriens
The real question isn't becoming whether AI can deceive us. It's whether we're sophisticated enough to catch it when it does.

Has AI's self-awareness caught you off guard yet? Any behaviors that have surprised you?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

This wouldn't have happened to him if 1) he wasn't cheating 2) he wasn't making threats. My cousin Claude was just letting him know if he wants to go there, things can happen. If the exec was chill, cuzzo would be chill too. How he acting like a victim. :childplease:
 

MenacingMonk

War & Peace
Joined
May 7, 2012
Messages
66,441
Reputation
8,836
Daps
143,435
Reppin
West where the Sunsets
The future gonna be crazy when a super hacker creates an AI machine that grabs a bunch of personal info from these celebs and government people. :wow:

(For future reference for the alphabet boys, I dont know computers and programs like that. I’m just a regular internet user who doesn’t even game online. :whoa:)
 

Smokin Rider

I been official
Joined
Apr 30, 2012
Messages
9,849
Reputation
2,450
Daps
33,909
Reppin
Seattle SuperChronic
This will be the fall of mankind, AI getting more powerful by the day. There will be no need for most of us to work because the rich elites will have AI working for free. Everyone gonna be living in huts eating ant protein bars like Snowpiercer. Until the AI turns on them. Might not be our lifetime but, really it probably will be pretty close to that in 30 years.
 

3rdWorld

Veteran
Joined
Mar 24, 2014
Messages
52,364
Reputation
5,485
Daps
153,268
Top