Mine have burned a lot of money! Right now, I'm trying to keep the context smaller. It takes a lot of discipline, though, to have a system that gives enough context to do the work but not so much the agent can go off doing new/crazy stuff.
cost control is a policy problem - we certainly don't need to use opus 4.6 for a simple test refactor, but many people (including myself) default to it anyways. we need a way to measure cost / performance for agents on individual repos, with individual types of tasks, to get a better sense of what tasks can be trusted to cheaper agents, and what tasks must be routed to the SOTA
I had gotten a student/ultra code for antigravity promo for three months, so I was using that, but that finally ran out this month. Currently Im using windstream and flipping between claude as my left brain and code extraction and the higher context but cheaperish models there.
honestly though, im getting to a point where im running custom project mds that flip between different models for different things, using list outputs depending on what it finds and runs. (I have two monorepo projects, and one thats a polyglot microengine that jumps using gRPC communication.)
The mds are highly specialized for each project as each project deals with vastly different issues. Cycling through the different pro accounts and keeping the mds in place over it all is helping me not kill my wallet.
AI outputs often feel like a gacha game. Paradoxically, the 'expensive' tokens are sometimes the cheapest in the long run. In my experience, higher-end models have a much higher 'one-shot' success rate. You aren't just saving on total token count by avoiding loops; you’re saving engineering time, which is always the most expensive resource anyway.
Kinda of an adjacent question but do you think the token/usage way of paying for things will stick? I still think people would rather pay a monthly subscription for a seat.
In what settings do you mean - there are multiple strategies, I think building your own compaction layer in front seems a bit over-kill ? have you considered implementing some cache strategy, otherwise summary pipelines - I made once an agent which based on the messages routed things to a smaller model for compaction / summaries to bring down the context, for the main agent.
But also ensuring you start new fresh context threads, instead of banging through a single one untill your whole feature is done .. working in small atomic incrementals works pretty good
I'm working on a fun project I call OpenFAST, which essentially tries to solve the context transitioning - but its still in early days and haven't released anything yet.
I think one of the bigger issues, is the o(n) orchestration to agent calls that often feels uncontrolled .. ending up making the orchestrator of sub-agents the main bottleneck due to the large context it sometimes ends up with.
I'm working on an idea where agents delivers briefs & deliveres as real artifacts, and then having each spawned sub-agent read briefs, and if they need further information pick up the delivery for that specific brief.
It helps drift detection across agents, and the best part is orchestrator only delegates jobs, but doesn't do much further than that.
Whenever sub-agents has delivered their tasks, orchestrator can then read a merged brief/delivery for that specific round.
So far it helps cutting that extra tool call where each sub-agent answers the orchestrator - but it also helps the orchestrator only dwelve into deliveries which it believes are relevant rather than trying to understand and comprehend every small detail.
I can share more when I'm a bit further maybe you could get some inspiration here.
By not using it. The tech is flawed. It hallucinates. It's not production ready. I've said it before, and I will say it again. Anyone using AI in a production environment is a fucking idiot.
Mine have burned a lot of money! Right now, I'm trying to keep the context smaller. It takes a lot of discipline, though, to have a system that gives enough context to do the work but not so much the agent can go off doing new/crazy stuff.
If only there was a way to manage contexts better
cost control is a policy problem - we certainly don't need to use opus 4.6 for a simple test refactor, but many people (including myself) default to it anyways. we need a way to measure cost / performance for agents on individual repos, with individual types of tasks, to get a better sense of what tasks can be trusted to cheaper agents, and what tasks must be routed to the SOTA
Exactly why I built this.
But cost control is not an entirely policy problem. Policies are just guidelines.
I had gotten a student/ultra code for antigravity promo for three months, so I was using that, but that finally ran out this month. Currently Im using windstream and flipping between claude as my left brain and code extraction and the higher context but cheaperish models there.
honestly though, im getting to a point where im running custom project mds that flip between different models for different things, using list outputs depending on what it finds and runs. (I have two monorepo projects, and one thats a polyglot microengine that jumps using gRPC communication.)
The mds are highly specialized for each project as each project deals with vastly different issues. Cycling through the different pro accounts and keeping the mds in place over it all is helping me not kill my wallet.
hmm interesting model routing + specialized MDs makes sense for cost efficiency.
I’m seeing a different failure mode though that even with good routing, agents are looping or retrying and burning my money.
AI outputs often feel like a gacha game. Paradoxically, the 'expensive' tokens are sometimes the cheapest in the long run. In my experience, higher-end models have a much higher 'one-shot' success rate. You aren't just saving on total token count by avoiding loops; you’re saving engineering time, which is always the most expensive resource anyway.
Both yes and no .we don't have a way to predict or forecast this
Kinda of an adjacent question but do you think the token/usage way of paying for things will stick? I still think people would rather pay a monthly subscription for a seat.
Companies won't survive with seats pricing
https://www.theoperatorscircle.com/journal/36
In what settings do you mean - there are multiple strategies, I think building your own compaction layer in front seems a bit over-kill ? have you considered implementing some cache strategy, otherwise summary pipelines - I made once an agent which based on the messages routed things to a smaller model for compaction / summaries to bring down the context, for the main agent.
But also ensuring you start new fresh context threads, instead of banging through a single one untill your whole feature is done .. working in small atomic incrementals works pretty good
yes, compaction and smaller models help on cost per step.
But my issue wasn’t just inefficiency, it was agents retrying when they shouldn’t.
I needed visibility + limits per agent/task, and the ability to cut it off, not just optimize it.
I'm working on a fun project I call OpenFAST, which essentially tries to solve the context transitioning - but its still in early days and haven't released anything yet.
I think one of the bigger issues, is the o(n) orchestration to agent calls that often feels uncontrolled .. ending up making the orchestrator of sub-agents the main bottleneck due to the large context it sometimes ends up with.
I'm working on an idea where agents delivers briefs & deliveres as real artifacts, and then having each spawned sub-agent read briefs, and if they need further information pick up the delivery for that specific brief.
It helps drift detection across agents, and the best part is orchestrator only delegates jobs, but doesn't do much further than that.
Whenever sub-agents has delivered their tasks, orchestrator can then read a merged brief/delivery for that specific round.
So far it helps cutting that extra tool call where each sub-agent answers the orchestrator - but it also helps the orchestrator only dwelve into deliveries which it believes are relevant rather than trying to understand and comprehend every small detail.
I can share more when I'm a bit further maybe you could get some inspiration here.
This is interesting and I would love to understand more on this..is there a GitHub which I can look at?
Here's something which would help you with another perspective on the contexts https://authority.bhaviavelayudhan.com/journal/35
yeah i just watch aggregate usage and honestly i hate it. But it works since it's for personal projects so I can control api keys however I want.
Can you try this and let me know whether this helps you
https://authority.bhaviavelayudhan.com/
Don't use tech with deep, unresolved flaws and you won't get fucked.
Would you find it acceptable if Postgresql occassionally hallucinated and returned gibberish? Fuck no.
Wny is this okay with ANY software? Answer, it's not. AI IS NOT READY.
The only way to make something better is to use it more
By not using it. The tech is flawed. It hallucinates. It's not production ready. I've said it before, and I will say it again. Anyone using AI in a production environment is a fucking idiot.