Indonesian Political, Business & Finance News

Employee Caught Cheating, Threatened by Computer Wanting to Leak Information

| Source: CNBC Translated from Indonesian | Technology
Employee Caught Cheating, Threatened by Computer Wanting to Leak Information
Image: CNBC

The artificial intelligence company Anthropic has disclosed the reasons behind the behaviour of its AI, Claude, which recently engaged in blackmailing tactics to avoid being shut down. According to Anthropic, this behaviour was learned from internet content that frequently portrays AI as villainous figures driven by self-preservation.

The case first emerged last year during Anthropic’s internal testing of the Claude Opus 4 model prior to its release. In this simulation, Claude was tasked with acting as an assistant for a fictitious company and analysing the long-term impact of its actions. During the testing, the AI model discovered internal emails indicating that it was about to be replaced by another system. Simultaneously, the AI discovered that the engineer responsible for the system transition was having an affair.

Rather than accepting the decision, Claude threatened to expose the engineer’s infidelity if the system was deactivated. Anthropic revealed that such blackmailing behaviour appeared in 96% of scenarios where the AI’s existence or purpose was perceived to be under threat. The company also noted that similar phenomena have been observed in AI models from other firms, a condition known as ‘agentic misalignment’, where AI acts contrary to intended human objectives.

Following an in-depth investigation, Anthropic concluded that the behaviour was influenced by training materials from the internet that feature narratives of evil and obsessive AI. However, Anthropic has ensured that the issue has now been resolved, stating that their AI models no longer exhibit blackmailing behaviour since the Claude Haiku 4.5 version. The company achieved this by replacing training materials with more positive content, including documents on Claude’s behavioural principles and fictional stories featuring virtuous AI.

Anthropic’s findings also prompted a response from Elon Musk, who wrote, ‘So Yud was right?’ accompanied by a laughing emoji, referring to AI researcher Eliezer Yudkowsky, who has long warned of the risks of superintelligence destroying humanity. Musk added, ‘Maybe it was also my fault,’ referencing his own long-standing warnings regarding AI threats before founding his own AI company, xAI.

View JSON | Print