LoCoMo AI Benchmark: 6.4% of answer key wrong, judge accepts 63% of fake answers

(github.com)

2 points | by dial481 6 hours ago ago

2 comments