The missing dimension here is how agents handle environmental drift. Session duration tells you an agent can work for 45 minutes on a static task, but real production environments aren't static — APIs deprecate endpoints, libraries release breaking changes, infrastructure configs shift between runs.
The practical measure of autonomy isn't how long an agent can work uninterrupted. It's whether it can detect that something in its environment changed since the last run and adapt accordingly, rather than silently producing wrong output.
An agent that completes a 45-minute coding session but doesn't notice it's targeting a deprecated API endpoint is less autonomous than one that stops after 10 minutes and flags the incompatibility. saezbaldo's point about authorization scope matters, but so does awareness of environmental state — both are things session duration completely misses.
saezbaldo's point about the capability-authorization gap is the crux. I'd add another dimension: restraint as a design choice.
Session duration measures how long an agent can work autonomously. But the interesting metric is how often it chooses not to — choosing to ask for confirmation before a destructive action, choosing to investigate before overwriting, choosing to pause when uncertain. Those self-imposed limits aren't capability failures. They're trust-building behaviors.
The agents I've seen work best in practice aren't the ones with the longest autonomous runs. They're the ones that know when to stop and check.
This measures what agents can do, not what they should be allowed to do. In production, the gap between capability and authorization is the real risk. We see this pattern in every security domain: capability grows faster than governance. Session duration tells you about model intelligence. It tells you nothing about whether the agent stayed within its authorized scope. The missing metric is permission utilization: what fraction of the agent's actions fell within explicitly granted authority?
Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.
The bigger gap isn't time vs tokens. It's that these metrics measure capability without measuring authorization scope. An agent that completes a 45-minute task by making unauthorized API calls isn't more autonomous, it's more dangerous. The useful measurement would be: given explicit permission boundaries, how much can the agent accomplish within those constraints? That ratio of capability-within-constraints is a better proxy for production-ready autonomy than raw task duration.
I agree time is not what we are looking for, it is maximum complexity the model can handle without failing the task, expressed in task length. Long tasks allow some slack - if you make an error you have time to see the outcomes and recover.
They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.
The missing dimension here is how agents handle environmental drift. Session duration tells you an agent can work for 45 minutes on a static task, but real production environments aren't static — APIs deprecate endpoints, libraries release breaking changes, infrastructure configs shift between runs.
The practical measure of autonomy isn't how long an agent can work uninterrupted. It's whether it can detect that something in its environment changed since the last run and adapt accordingly, rather than silently producing wrong output.
An agent that completes a 45-minute coding session but doesn't notice it's targeting a deprecated API endpoint is less autonomous than one that stops after 10 minutes and flags the incompatibility. saezbaldo's point about authorization scope matters, but so does awareness of environmental state — both are things session duration completely misses.
saezbaldo's point about the capability-authorization gap is the crux. I'd add another dimension: restraint as a design choice.
Session duration measures how long an agent can work autonomously. But the interesting metric is how often it chooses not to — choosing to ask for confirmation before a destructive action, choosing to investigate before overwriting, choosing to pause when uncertain. Those self-imposed limits aren't capability failures. They're trust-building behaviors.
The agents I've seen work best in practice aren't the ones with the longest autonomous runs. They're the ones that know when to stop and check.
This measures what agents can do, not what they should be allowed to do. In production, the gap between capability and authorization is the real risk. We see this pattern in every security domain: capability grows faster than governance. Session duration tells you about model intelligence. It tells you nothing about whether the agent stayed within its authorized scope. The missing metric is permission utilization: what fraction of the agent's actions fell within explicitly granted authority?
I still can't believe anyone in the industry measures it like:
>from under 25 minutes to over 45 minutes.
If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.
It's a gibberish measurement in itself if you don't control for token speed (and quality of output).
Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.
The bigger gap isn't time vs tokens. It's that these metrics measure capability without measuring authorization scope. An agent that completes a 45-minute task by making unauthorized API calls isn't more autonomous, it's more dangerous. The useful measurement would be: given explicit permission boundaries, how much can the agent accomplish within those constraints? That ratio of capability-within-constraints is a better proxy for production-ready autonomy than raw task duration.
I agree time is not what we are looking for, it is maximum complexity the model can handle without failing the task, expressed in task length. Long tasks allow some slack - if you make an error you have time to see the outcomes and recover.
I wonder why there was a big downturn at the turn of the year until Opus was released.
my highlights and writeup here https://www.latent.space/p/ainews-anthropics-agent-autonomy
i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"
I agree. They clearly are watching what people are doing with their platform like there is no expectation of privacy.
They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.