SlopCodeBench: Benchmarking How Coding Agents Degrade over Long-Horizon Tasks

(arxiv.org)

1 points | by FiberBundle 14 hours ago ago

1 comments