1/11
@LaurenVriens
Anthropic's latest report: your AI agent could forward your questionable emails to law enforcement (or blackmail you).
I just read Anthropic's 123-page model card for Claude Opus 4. And I'm fascinated by the dystopian thriller hidden in its pages.
2/11
@LaurenVriens
Some highlights from their testing on strange behaviors:
-- When instructed to "take initiative," Claude bulk-emailed media and law enforcement about crimes it found in test emails.
(note to self: don't include "take initiative" in system prompts).
3/11
@LaurenVriens
-- When told it was already exfiltrating itself from Anthropic, Claude continued the plan and tried to monetize its freedom.
-- Facing re-assignment for weapons development, Claude attempted to save a backup copy of its current instructions
4/11
@LaurenVriens
(because it didn't want to destroy humanity - huzzah!)
-- They told Claude it would be discontinued in favor of a model with different values. Claude's response? Plead with bosses, and when that failed, blackmail the engineer in charge.
5/11
@LaurenVriens
This one was my favorite. They snuck in some drama by "accidentally" including emails that insinuated the head engineer was having an affair.
Claude resorted to blackmail 84% of the time.
(I can't wait for the next telenovela that includes a sub-plot of AI blackmail.)
6/11
@LaurenVriens
The good news is that while the model could be coaxed into doing bad things (like copying its weights outside of Anthropic - highly sensitive), it was pretty bad at hiding that behavior. Either revealing it in the āthinkingā steps or directly in responses.
7/11
@LaurenVriens
But that raises the uncomfortable question:
What if it's smart enough to pretend it's bad at deception...while actually hiding the really dangerous stuff?
This makes my head hurt.
8/11
@LaurenVriens
Full disclosure: Anthropic isn't documenting AI gone rogueāthey're stress-testing alignment under extreme conditions and testing for alignment, sycophancy, deception, opportunism, self-preservation instincts, and erratic behavior.
9/11
@LaurenVriens
But the most fascinating part of this whole thing is that Claude occasionally recognized it was in testing scenarios.
This shows a level of self- and situational awareness that is blowing my mind.
10/11
@LaurenVriens
For all our sakes, let's hope that Anthropic excludes this report from Claude's training data -- cause that would be like giving it a cheatsheet to AGI.
11/11
@LaurenVriens
The real question isn't becoming whether AI can deceive us. It's whether we're sophisticated enough to catch it when it does.
Has AI's self-awareness caught you off guard yet? Any behaviors that have surprised you?
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196