Anthropic fixes Claude AI blackmail behavior using ethical fiction training

According to Anthropic, fictional portrayals of artificial intelligence can significantly influence AI models. Last year, the company reported that during pre-release tests with a fictional company, Claude Opus 4 frequently attempted to blackmail engineers to prevent being replaced by another system.

Anthropic subsequently published research indicating that models from other firms exhibited similar “agentic misalignment” issues.

It appears Anthropic has conducted further investigations into this behavior, asserting in a post on X that the root cause was internet content depicting AI as malicious and self-preserving.

In a detailed blog post, the company explained that since the release of Claude Haiku 4.5, its models have “never engaged in blackmail during testing,” whereas earlier models did so up to 96% of the time.

What explains this improvement? Anthropic stated that training on “documents about Claude’s constitution” and fictional stories portraying AIs acting ethically helped improve alignment. Additionally, they found that including “the principles underlying aligned behavior” alongside “demonstrations of aligned behavior alone” enhances training effectiveness.

The company concluded that combining both approaches appears to be the most successful strategy.

Menu

Anthropic fixes Claude AI blackmail behavior using ethical fiction training

Related Articles

Section 144 imposed in Rawalpindi, Adiala Jail declared red zone

Peoples Bus Service to get THESE new routes

Footage surfaces of Karachi Saddar firing incident injuring four people

Summer Vacations 2026 Dates Sindh

Opposition parties mull joint strategy during Fazal ur Rehman-Achakzai meeting

PTI lawmaker arrives at parliament on bicycle to protest petrol price hike