We originally built this out of pure frustration. While working on our own product (Emitta), we realized that having an LLM 'look' at a screen via vision and guess where to click was ridiculously slow, unreliable, and expensive.
We looked at MCP, but that's strictly data/tools. We looked at AG-UI and A2UI, but they require building net-new components. We just wanted the agent to operate the clunky, existing UI we already had. So we wrote a protocol that basically gives the agent a structured 'map' of the live DOM, and lets it send back native execution commands (like set_field, click).
The reference server is up on npm (@acprotocol/server). I’m around all day and would love to hear your thoughts on the architecture, if the action set (8 actions) makes sense to you, or if you think the native UI-control approach is the right path forward.
Hi everyone, I'm one of the creators of ACP.
We originally built this out of pure frustration. While working on our own product (Emitta), we realized that having an LLM 'look' at a screen via vision and guess where to click was ridiculously slow, unreliable, and expensive.
We looked at MCP, but that's strictly data/tools. We looked at AG-UI and A2UI, but they require building net-new components. We just wanted the agent to operate the clunky, existing UI we already had. So we wrote a protocol that basically gives the agent a structured 'map' of the live DOM, and lets it send back native execution commands (like set_field, click).
The reference server is up on npm (@acprotocol/server). I’m around all day and would love to hear your thoughts on the architecture, if the action set (8 actions) makes sense to you, or if you think the native UI-control approach is the right path forward.