Hey all, we just released our work on self-improving AI systems at NeoSigma. We show our auto agent harness improvement system on Tau3 benchmark tasks where the agent’s score improves from 0.56 to 0.78 (~40% jump) while mining failures and auto maintaining live evals.
We got a lot of responses from people wanting to try the self-improving loop on their own agent, so we open-sourced our setup.
Releasing auto-harness: an open source library for our self improving agentic systems with auto-evals.
Connect your agent and let it cook over the weekend. Watch it go brrrr!!
Link to the article here: https://x.com/gauri__gupta/status/2040251170099524025
Hey all, we just released our work on self-improving AI systems at NeoSigma. We show our auto agent harness improvement system on Tau3 benchmark tasks where the agent’s score improves from 0.56 to 0.78 (~40% jump) while mining failures and auto maintaining live evals. We got a lot of responses from people wanting to try the self-improving loop on their own agent, so we open-sourced our setup. Releasing auto-harness: an open source library for our self improving agentic systems with auto-evals. Connect your agent and let it cook over the weekend. Watch it go brrrr!! Link to the article here: https://x.com/gauri__gupta/status/2040251170099524025
Point it at your agent. Leave it running. Come back to a better agent with evals!!