Anthropic Claude Mythos cyber capabilities advance rapidly in new AISI tests

A A
Resize

Anthropic’s Claude Mythos, which the company claims is too powerful for general release, appears to have gained new capabilities.

In a Wednesday blog post, the UK AI Security Institute (AISI) reported testing a newer version of Mythos that outperformed its earlier versions and OpenAI’s GPT-5.5, just a month after Mythos’ initial launch.

“The newer Mythos Preview checkpoint completed both our cyber ranges, solving the range ‘The Last Ones’ in 6 of 10 attempts and the previously unsolved ‘Cooling Tower’ in 3 of 10,” the blog noted. “This was the first time a model completed the second of our two cyber ranges.”

When Anthropic announced Mythos Preview and Project Glasswing—the cybersecurity testing alliance with rival AI labs and companies—last month, AISI found that the model “represents a step up over previous frontier models in a landscape where cyber performance was rapidly advancing.”

This third-party assessment balanced claims spanning from Mythos being merely marketing hype to indicating a major shift in AI capabilities. The truth likely lies somewhere in between.

The updated tests also show that improvements aren’t limited to new model releases but can occur within versions.

AISI highlighted that AI models are rapidly improving at handling cyber tasks, which has serious cybersecurity implications, given Mythos’ proficiency in identifying software vulnerabilities.

Read More: OpenAI says no user data breached after security issue with open-source library

They noted that the speed of cyber task completion has doubled every 4.7 months since late 2024, faster than their November 2025 estimate of 8 months. This trend was exceeded by Mythos and GPT-5.5, though it’s uncertain whether this acceleration will continue or if these models are exceptions. The tests were limited to tasks of 2.5 million tokens, which helps compare performance but underrepresents what these models can achieve.

Due to high success rates on short tasks, the upper error bounds are large, and the models might perform significantly better beyond the token limit, making performance difficult to quantify. Larger token capacities and more complex agent infrastructures would likely further enhance capabilities.

The blog concludes that a 2.5 million-token limit is relatively low—using up to 100 million tokens in experiments—and that performance could improve with higher token budgets, especially for recent models that benefit most from increased context length.