Recursive models like TRM/CTM/UT have create a lot of buzz lately. But they're rarely used outside of static, toy domains - especially language.
In 2018, we saw "Universal Transformers" try this. However, follow-up works reveal that simple RLMs (recursive LMs) don't yield substantial performance gains w.r.t FLOPs spent
In this work, we argue that using some rather simple tricks, one can unlock huge performance gains and make RLMs outperform iso-param and iso-FLOP baselines
Recursive models like TRM/CTM/UT have create a lot of buzz lately. But they're rarely used outside of static, toy domains - especially language.
In 2018, we saw "Universal Transformers" try this. However, follow-up works reveal that simple RLMs (recursive LMs) don't yield substantial performance gains w.r.t FLOPs spent
In this work, we argue that using some rather simple tricks, one can unlock huge performance gains and make RLMs outperform iso-param and iso-FLOP baselines