Interesting writeup. The tiered retrieval approach and privacy model for group chats are well thought out.
One thing I'd love to see: what does "actually works" mean in measurable terms? The engineering is sophisticated, but I'm curious about user-facing impact - did memory injection improve task completion or satisfaction? How often do users invoke /memory forget? What's the false positive rate on extraction?
These systems are hard to evaluate because failure modes are subtle - the AI "knows" something but uses it awkwardly, or surfaces context that feels intrusive. Would be great to hear what metrics you're tracking to validate the complexity is paying off.
Thank you and great question. Right now, feedback is qualitative only. (Surveys, feedback buttons, controlled user tests). We are trying to build AI evaluators but they suffer from the same problem
when trying to evaluate whether the “right” memory was pulled.
I am not sure how this use case is prevelant on your system but in my sessions with chatgpt, claude web, claude code, I often find myself in a situation where I enjoy the fact that it is stateless. I can give a fresh context of who I am and get a suitable reply.
Interesting writeup. The tiered retrieval approach and privacy model for group chats are well thought out.
One thing I'd love to see: what does "actually works" mean in measurable terms? The engineering is sophisticated, but I'm curious about user-facing impact - did memory injection improve task completion or satisfaction? How often do users invoke /memory forget? What's the false positive rate on extraction?
These systems are hard to evaluate because failure modes are subtle - the AI "knows" something but uses it awkwardly, or surfaces context that feels intrusive. Would be great to hear what metrics you're tracking to validate the complexity is paying off.
Thank you and great question. Right now, feedback is qualitative only. (Surveys, feedback buttons, controlled user tests). We are trying to build AI evaluators but they suffer from the same problem when trying to evaluate whether the “right” memory was pulled.
Still trying to find a good solution here.
I am not sure how this use case is prevelant on your system but in my sessions with chatgpt, claude web, claude code, I often find myself in a situation where I enjoy the fact that it is stateless. I can give a fresh context of who I am and get a suitable reply.