Amidst a lot of analyses and results I can vaguely understand, this conclusion stands out:
We assess that Claude Mythos Preview does not cross the automated AI-R&D capability threshold. We hold this with less confidence than for any prior model. The most significant factor in this determination is that we have been using it extensively in the course of our day-to-day work and exploring where it can automate such work, and it does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones. Although we believe this is an informed determination, it is inherently difficult to make its basis legible, given the model’s very strong performance at tasks that are well-defined and verifiable enough to serve as formal evaluations.
The ECI slope-ratio measurement we introduce in section 2.3.6 shows an upward bend in the capability trajectory at this model, though the degree of the upward bend varies significantly across dataset and methodological changes we made to stress-test it. The identifiable driver traces to specific human research advances made without meaningful assistance from the models then available. That said, we will be continuing to monitor this trend to see whether acceleration continues, especially if this is plausibly traceable to AI’s own contributions.
The bottom line: This new Claude model is not yet capable enough to autonomously do AI research — but it's closer than any previous model, and Anthropic is nervous about it.
What's the "automated AI-R&D capability threshold"?
Anthropic has defined a danger line: if an AI can independently do the work of AI researchers, that's a big deal — because then AI could start improving itself without humans in the loop. This assessment is asking: has this model crossed that line?
Why are they less confident than usual?
With past models, the answer was a comfortable "no." This time, they're saying "no, but..." — it's a much closer call. They're hedging.
The AI researchers designed tests to evaluate whether the model can do their real day-to-day work. They found out Mythos scored well on structured tests, but they know themselves that structured tests do not capture the non-linear, intangible aspects of AI research. So, interesting results, but AI can't replace them yet and AGI still far away.
This model card is eye-opening (I think it might be designed to be). The alignment and model welfare sections are extensive, which is heartening. At least on the surface Anthropic seems to be living up to its promises RE safety. That said, has anyone else read section 5.2.3 in the Alignment Risk Update https://www-cdn.anthropic.com/79c2d46d997783b9d2fb3241de4321...? This is referenced in the model card in 4.1.3. Basically they ended up training a the model with an RL reward model that had access to the model's reasoning in 8% of cases, by accident. The problem being that the model could learn to directly manipulate it's reasoning traces to satisfy external observers. This seems like a huge deal and it may have partially poisoned Anthropic's interpretability pipeline moving forward.
Amidst a lot of analyses and results I can vaguely understand, this conclusion stands out:
We assess that Claude Mythos Preview does not cross the automated AI-R&D capability threshold. We hold this with less confidence than for any prior model. The most significant factor in this determination is that we have been using it extensively in the course of our day-to-day work and exploring where it can automate such work, and it does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones. Although we believe this is an informed determination, it is inherently difficult to make its basis legible, given the model’s very strong performance at tasks that are well-defined and verifiable enough to serve as formal evaluations.
The ECI slope-ratio measurement we introduce in section 2.3.6 shows an upward bend in the capability trajectory at this model, though the degree of the upward bend varies significantly across dataset and methodological changes we made to stress-test it. The identifiable driver traces to specific human research advances made without meaningful assistance from the models then available. That said, we will be continuing to monitor this trend to see whether acceleration continues, especially if this is plausibly traceable to AI’s own contributions.
The bottom line: This new Claude model is not yet capable enough to autonomously do AI research — but it's closer than any previous model, and Anthropic is nervous about it.
What's the "automated AI-R&D capability threshold"? Anthropic has defined a danger line: if an AI can independently do the work of AI researchers, that's a big deal — because then AI could start improving itself without humans in the loop. This assessment is asking: has this model crossed that line?
Why are they less confident than usual? With past models, the answer was a comfortable "no." This time, they're saying "no, but..." — it's a much closer call. They're hedging.
The AI researchers designed tests to evaluate whether the model can do their real day-to-day work. They found out Mythos scored well on structured tests, but they know themselves that structured tests do not capture the non-linear, intangible aspects of AI research. So, interesting results, but AI can't replace them yet and AGI still far away.
That's how they reached this conclusion.
This model card is eye-opening (I think it might be designed to be). The alignment and model welfare sections are extensive, which is heartening. At least on the surface Anthropic seems to be living up to its promises RE safety. That said, has anyone else read section 5.2.3 in the Alignment Risk Update https://www-cdn.anthropic.com/79c2d46d997783b9d2fb3241de4321...? This is referenced in the model card in 4.1.3. Basically they ended up training a the model with an RL reward model that had access to the model's reasoning in 8% of cases, by accident. The problem being that the model could learn to directly manipulate it's reasoning traces to satisfy external observers. This seems like a huge deal and it may have partially poisoned Anthropic's interpretability pipeline moving forward.