really appreciate the pointer to the Chroma research, the context rot framing matches what i've been seeing.
Even with large context windows, the signal-to-noise drops quickly, especially when tool schemas are included upfront but not actually used.
with ARK, i’m trying to treat context more like a constrained working set rather than something static. It starts minimal and only expands when there is signal (failures, ambiguity, etc.), so the model isn’t reasoning over stale context.
Curious about your approach — are you leaning more toward:
restructuring how context is stored/retrieved (external memory, RAG, etc.), or
dynamically controlling what actually enters the prompt at each step?
Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.
> Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.
I'm really not focusing on latency right now. My short term goal is to prove the thesis that `ail` can improve same-model performance on SWEBench Pro vs. their own published results.
Can I run swebp with GLM-4.6 and get a score better than their published `68.20` https://www.swebench.com/?
The argument is that the latency right now just isn't the part we should worry about. If we're reducing the time to code something from ~6 weeks to 1 hour... then does it really matter tha we add an other 30 minutes of tool calls if we get it 100% right vs. 80% right?
so - my approach is still being built and I'm still very hand wavy around how it is going to come together, but effecively I'm building pipelines of prompts. Rather than running our LLM sequences as long running sessions where the entire context gets loaded on every turn (a recipe for rot), we unlock the ability to introduce a thinking layer at each step in between the process.
So before each turn is sent into the LLM we (potentially) run a local process to assemble a bespoke context of only what is required for that specific turn.
If a tool call is not going to be needed on the prompt, we don't include it in the system prompt on that round.
I'm still formalizing the spec at the moment and think I'm about six months to a year out before I have a full human ready UI running.
Essentially I'm trying to build an artificial neocortex and frontal lobe to provide a complete layer of Executive Function that operates on top of our agents - like Claude Code (or whatever else).
I'm basing the roadmap on the about 100 years of cognitive science. We've legitimately had names for all these failure modes (in humans) since the 1960's. We have observations of what we're witnessing in agents from 1848.
this is a pretty important piece and the research backs you up. Moving that context out of your system prompt dynamically is going to help reduce your lost in the middle effect. Context rots almost immediately. I've got a project that is being built to address this directly as well, but I'm still very early days.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638
really appreciate the pointer to the Chroma research, the context rot framing matches what i've been seeing. Even with large context windows, the signal-to-noise drops quickly, especially when tool schemas are included upfront but not actually used.
with ARK, i’m trying to treat context more like a constrained working set rather than something static. It starts minimal and only expands when there is signal (failures, ambiguity, etc.), so the model isn’t reasoning over stale context.
Curious about your approach — are you leaning more toward:
restructuring how context is stored/retrieved (external memory, RAG, etc.), or dynamically controlling what actually enters the prompt at each step?
Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.
to directly answer this bit:
> Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.
I'm really not focusing on latency right now. My short term goal is to prove the thesis that `ail` can improve same-model performance on SWEBench Pro vs. their own published results.
Can I run swebp with GLM-4.6 and get a score better than their published `68.20` https://www.swebench.com/?
The argument is that the latency right now just isn't the part we should worry about. If we're reducing the time to code something from ~6 weeks to 1 hour... then does it really matter tha we add an other 30 minutes of tool calls if we get it 100% right vs. 80% right?
Make it work -> Make it right -> make it fast.
I'm still on the first one tbh :rofl-emoji:
[dead]
so - my approach is still being built and I'm still very hand wavy around how it is going to come together, but effecively I'm building pipelines of prompts. Rather than running our LLM sequences as long running sessions where the entire context gets loaded on every turn (a recipe for rot), we unlock the ability to introduce a thinking layer at each step in between the process.
So before each turn is sent into the LLM we (potentially) run a local process to assemble a bespoke context of only what is required for that specific turn.
If a tool call is not going to be needed on the prompt, we don't include it in the system prompt on that round.
I'm still formalizing the spec at the moment and think I'm about six months to a year out before I have a full human ready UI running.
This is the foundational paper I'm basing the tool on: https://github.com/AlexChesser/ail/blob/main/docs/blog/the-y... while the spec starts here: https://github.com/AlexChesser/ail/blob/main/spec/core/s01-p...
Essentially I'm trying to build an artificial neocortex and frontal lobe to provide a complete layer of Executive Function that operates on top of our agents - like Claude Code (or whatever else).
I'm basing the roadmap on the about 100 years of cognitive science. We've legitimately had names for all these failure modes (in humans) since the 1960's. We have observations of what we're witnessing in agents from 1848.
We have the roadmap from Psychology.
[dead]
this is a pretty important piece and the research backs you up. Moving that context out of your system prompt dynamically is going to help reduce your lost in the middle effect. Context rots almost immediately. I've got a project that is being built to address this directly as well, but I'm still very early days.
Keep it up! you're on the right track.
Hong, K., & Chroma Research Team. (2025). Context rot: How increasing input tokens impacts LLM performance. Chroma Research. https://research.trychroma.com/context-rot
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638
[dead]