Building a tool to compare value across LLM provider options.
Part of it tracks how many tokens you actually get from various subscriptions, over time.
Past week, multiple people asked me about it — they'd been hitting Claude and Codex limits faster than expected.
Ran the tests yesterday. Reran today. Here's what came back:
▸ ChatGPT Plus / GPT-5.5: 95M → 37M tokens/week (−61%)
▸ Claude Max 20× / Sonnet 4.6: 388M → 214M (−45%)
▸ Claude Max 20× / Opus 4.7: 248M → 162M (−35%)
▸ Claude Pro / Sonnet 4.6: 19.6M → 11.4M (−42%)
▸ Claude Pro / Opus 4.7: 15.6M → 10.2M (−35%)
5 of 5 retested plans dropped 35-61% in five days. None went up.
"Quality metrics" need much more discussion and attention, in my opinion.
Not a criticism of this project — it's a good idea, it just highlights the central question of "how well is this model working?" I'm not sure it's so straightforward.
I agree!
My "dream" way to do it is closer to how Aider Leaderboard works but even bit better.
To have GDPEval like set to tasks but you have information across all tasks and all models of how much time/tokens/money/quality you get from particular model on particular task.
I was thinking to do evals against skills in that sense.
But that is huge and expensive project.
Only "approximation" I could pull of reasonably to get this started was to use benchmark scores as "surrogate" for that.
But working on a way to get this going. If you have additional thoughts on how to approach this I it would be super valuable.
What we learned recently is that they are tuning things. There is less compute, less tokens, less of what ever else is there thrown at it as time goes on.
They use these releases to get users. When they got them they can play around with "degrading" model just enough not to loose users but save on costs. It sadly kinda makes sense...
Building a tool to compare value across LLM provider options.
Part of it tracks how many tokens you actually get from various subscriptions, over time.
Past week, multiple people asked me about it — they'd been hitting Claude and Codex limits faster than expected.
Ran the tests yesterday. Reran today. Here's what came back: ▸ ChatGPT Plus / GPT-5.5: 95M → 37M tokens/week (−61%) ▸ Claude Max 20× / Sonnet 4.6: 388M → 214M (−45%) ▸ Claude Max 20× / Opus 4.7: 248M → 162M (−35%) ▸ Claude Pro / Sonnet 4.6: 19.6M → 11.4M (−42%) ▸ Claude Pro / Opus 4.7: 15.6M → 10.2M (−35%)
5 of 5 retested plans dropped 35-61% in five days. None went up.
Anyone else seeing similar in their own usage?
Vouched you. Side note: Perhaps include anchor tags with IDs so you can skip to a section like
>Value over time by provider
Via a #fragment at the end of your url. It looks like you're selling me something at the top of the page so I can see how you were flagged dead.
"Quality metrics" need much more discussion and attention, in my opinion.
Not a criticism of this project — it's a good idea, it just highlights the central question of "how well is this model working?" I'm not sure it's so straightforward.
I agree! My "dream" way to do it is closer to how Aider Leaderboard works but even bit better. To have GDPEval like set to tasks but you have information across all tasks and all models of how much time/tokens/money/quality you get from particular model on particular task. I was thinking to do evals against skills in that sense.
But that is huge and expensive project. Only "approximation" I could pull of reasonably to get this started was to use benchmark scores as "surrogate" for that.
But working on a way to get this going. If you have additional thoughts on how to approach this I it would be super valuable.
AI always seems to perform best on the first day after release, and then its performance gradually declines.
Is the AI itself degrading? Or is it because of product-policy changes, such as system prompt modifications and usage limits? Or is it both?
I sometimes wonder whether degradation is simply an inherent property of LLMs themselves.
What we learned recently is that they are tuning things. There is less compute, less tokens, less of what ever else is there thrown at it as time goes on.
They use these releases to get users. When they got them they can play around with "degrading" model just enough not to loose users but save on costs. It sadly kinda makes sense...
so happy clang's output is consistently great
:D