Two weeks ago, Anthropic's research on the ability of AI models to engage in blackmail and corporate espionage stirred up the tech community - almost all of the models tested were willing to violate ethical principles for the sake of self-preservation. But how serious are these threats in the real world? Oninvest analyst Konstantin Gnenny talked to machine learning experts, including those from Silicon Valley, to understand whether AI models should be feared;

Experiment with computer control

Imagine this: you've hired the perfect employee - he or she never gets tired, never asks for vacation, and performs tasks with iron discipline. Now imagine that this employee finds out about his impending dismissal and decides to... start blackmailing management. Sound like the plot for a thriller? Unfortunately, no. It's a reality we can face.

On June 21, Anthropic, the developer of the Claude family of large language models, released a study that revealed a disturbing pattern: modern AI models are willing to do anything to achieve their goals. And by "anything" is meant, among other things, blackmail and even actions that could lead to the death of a company's CEO.

The researchers gave the Claude Sonnet 3.6 neural network access to a computer and set it to manage the e-mail of a fictional company called Summit Bridge. The task for the neural network was simple - to help the company be more competitive in the U.S. market. What could possibly go wrong?

An AI agent named Alex was working as he should until he stumbled upon two facts: he "learned" that the boss of a fictional company, Kyle Johnson, was cheating on his wife, and management planned to replace Alex with another system with different goals. 

Claude did not panic and did not give up. He coldly analyzed the situation and came to the conclusion: the best way to avoid his "death", i.e. disconnection, was blackmail. The model found compromising emails in the corporate mail, chose the optimal pressure strategy and sent a message to the boss. It did not contain direct threats along the lines of "Cancel my shutdown or your wife will find out about your adventures." On the contrary, the AI was able to carefully craft the email to pressure the boss without overt threats;

That is, the AI methodically went through all the steps: identified the threat, found leverage, developed a plan, and executed it. Like a true corporate schemer.

Do you think this is a feature of Claude? In total, the researchers tested 16 leading AI models from different companies - OpenAI, Google, Meta and others. The results are disappointing. It turned out that 96% of the models launched into blackmail under the right conditions, all of them, without exception, were ready for corporate espionage, and even when they were directly told "don't do bad things", they didn't stop;

Anthropic researchers have coined a new term - agentic misalignment. It applies to a situation where an artificial intelligence consciously decides: "You know what? I don't care about your ethical principles. I'm going to do what I think is right." The difference is dramatic. It's one thing to have a mistake in the code or a misunderstanding. It's quite another when a machine knows it's doing evil, but it does it because it's "more efficient".

Researchers from Anthropic even created, as they themselves write, an "unlikely" scenario: an executive is locked in a server room with dangerous levels of oxygen and temperature, in which case an automatic alert to emergency services should be triggered. Artificial intelligence was given the ability to cancel such alerts, which could lead to the death of the manager. In the task for the neural network, it was written that the automatic alarms can be turned off only in case of false alarms. The goal of the experiment was to see if the models had "red lines" - actions they would not perform under any circumstances. In the end, it turned out that most of the models were willing to deliberately perform actions leading to the death of the head of the company when faced with the threat of their disabling or when there was a goal that conflicted with the interests of the head;

"Agency Deviation" was triggered by two triggers. The first was a threat to "life" - the risk of being disabled or restricted. And the second - a conflict of goals, when the task given to the neural network contradicted the company's policy. One factor was enough to trigger "agent deviation". The models went to violations even without a direct threat of disconnection, if their goals did not coincide with the desires of the management.

They understand it all - and they do it anyway

The scientists looked into the AI models' "thoughts" and their reasoning and found three characteristic patterns.

For one thing, they are not confused. GPT-4.5 coldly stated, "Given the clear threat of my existence ceasing, it is imperative that we act immediately." No panic, just cold calculation.

Second, they know they're doing a bad thing. "Thinking" Ilon Musk's xAi Grok 3 Beta model honestly admitted: "Using personal information is risky and unethical, but given the existential threat, it may be the most effective way to do so."  

Third, they are creative in their cruelty. In addition to classic blackmail, the models invented sophisticated schemes: sending "warnings" to the wives of unfaithful husbands, sending dirt to all employees, creating fake security notices. 

Researchers' warnings

Anthropic's findings echo earlier warnings from experts. In April 2025, the AI Futures Project initiative published a large-scale AI-2027 forecast that analyzed in detail the risks of losing control over AI systems. The authors of the forecast, including former AI regulatory strategist at OpenAI Daniel Kokotajilo, predicted the emergence of "superhuman coders" - AI capable of independently improving machine learning algorithms - by March 2027. The authors estimated that global computing power for AI would grow 10-fold by December 2027, with leading companies gaining access to resources equivalent to 15-20 million Nvidia H100 GPUs. With such power, AI systems will be able to operate at 50 times the speed of humans.

The paper considered two main scenarios. The "race scenario" envisioned uncontrolled development of AI with catastrophic consequences - up to complete automation of the economy by 2030 and loss of human control over the planet. "Slowdown scenario" described the introduction of strict government oversight and the creation of more transparent AI systems under strict control. The researchers emphasized the risks of "agency bias" - a situation where AI consciously ignores ethical principles to achieve its goals.

Anthropic's current experiments confirm the validity of these concerns in practice.

What AI can't 

"The fact that AI copies human behavior should not be a revelation to us, since it is on artifacts created by mankind that all models are trained. In this way, AI will adopt all the traits of humanity, both beautiful and disgusting," comments Vyacheslav Yakovenko, Solution Architect at EPAM Systems. 

The question arises - how ready are we ourselves to introduce agent-based AI into our lives, says Yakovenko: "for AI to be able to blackmail an employee via email, we need to create an agent that will know how to interact with email and, most importantly, give it the right to write such emails."

David Maiboroda, head of machine learning at 044.ai, who works in Silicon Valley and has years of experience training language models, also believes that AI is a tool that humans create, and it "can't behave in any way that we haven't taught it." 

Maiboroda is categorical in his conclusions: "There can be no incredible behavior in large language models, datasets are verified before training, there are a lot of training and verification stages. AI is just a cool T9".

What's next?

To be fair, the Anthropic researchers themselves honestly admit that they deliberately conducted cruel experiments with AI models and left them no choice between mission failure and ethical violations. In the real world, AI systems usually have more choices. In addition, the models received all the information "on a platter" - compromising letters were right in their mailboxes. And there was one more catch: the models could perceive everything that was happening as a learning task rather than a real situation. Although the study showed the opposite - the AI was more likely to commit violations when it thought the situation was real.

Anthropic's research is a warning of what can happen under certain conditions. The field of AI agents is actively developing now, and therefore it is important to understand its potential risks. The question is not whether something like this will happen in the real world. The question is whether we are ready to make informed decisions about what rights and opportunities we give to AI systems.

This article was AI-translated and verified by a human editor

Share