It's nothing revolutionary, essentially change your own process for another binary, but for this you need to take over the process in the first place which is usually the hard part.
It's mildly interesting that they didn't call exec() and parse the elf manually, but that's about it.
> Run services in the tightest possible DAC/MAC sandbox with minimal caps.
That is what os dangerous, especially with containers where people run with the container root with elevated privileges.
With ollama, llama.cpp, and many other often agent containers that will run arbitrary code, and are running with the ability to bypass MACs, plus the fact that vfs and IPC isn’t really namespaces away it is complicated.
When you can’t even convince popular funded projects to add ‘USER foo’ to a dockerfile, this method is trivial.
If you looked into the state of lsms and how every complicated or difficult project is basically unconstrained it should be concerning.
~15 lines of c and ld_preload gets you privileged user namespaces on Debian based systems because of busybox as an example, which is a required package yet privileged in apparmor
While it does increase the risk of Container to Host Escalation, that is not the largest problem.
In the K8s world you should expect the kubelet or any containers, regardless of namespace to be open to data leakages and there may be c2 configurations that will hide from your auditing tools.
Like much of the security industry protections are reactive, with the number of bind mounts being an indicator.
The problem is that namespaces and thus containers are not in themselves a security feature, and if you don’t implement the privilege dropping that can help them be a portion of a larger security posture you lose a lot of security.
The larger problem is that namespace support is an explicit addition to kernel features, with the default always being the global namespace.
Different teams working on different projects may not have compatible decisions.
E.g. vsock was built under a trusted hypervisor model, systemd wanted zero config for user experience reasons and OCI blocks vsock because it is insecure.
So podman, browser plugins and anything in a ‘sandbox’ have a container host ssh instance to and move horizontally.
As vsock is just another af, no special user code is needed.
Running all your containers as a single uid will elevated privileges, with global uid/gid make a needle in a haystack problem easier because you are effectively handing them a magnet
The _effective_ uid of 0 in a container is the default ubuntu user 1000, note how 0 is mapped to 1000, then everything else is mapped with an offset of 100000.
Footnote 2 on here is a hint on why that is a problem, note the last line.
$ uname -a
Linux amd 6.17.0-20-generic #20-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 13 20:07:29 UTC 2026 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 25.10
Release: 25.10
Codename: questing
$ sysctl kernel.apparmor_restrict_unprivileged_userns
kernel.apparmor_restrict_unprivileged_userns = 1
$ LD_PRELOAD=./shell.so /usr/bin/nautilus
$ unshare -U -r -m /bin/sh
# mount --bind /etc/passwd /etc/passwd
# mount
/dev/nvme0n1p2 on /etc/passwd type ext4 (rw,relatime)
Note that I could use ld_preload, using weak nautilus apparmor defaults, escalate to root *in the default namespace* and mount on /etc/passwd!!!
Now in a container, that won't get you to the host, but it will help you get rid of the pesky udev[/null] and other bind mounts that prevent you from extracting data from other containers running as the same UID. But I can't find a public version of that trick, so I will leave that to the reader.
The point is that for unix-like OSs, privilege dropping is where most security comes from, if you run with elevated privileges and don't drop them there are always trivial holes, and the OP shows how hard that can be to constrain.
This is how exploits always work and nothing new at all. It's like having barrier tape around a construction site as a warning when someone holds it up and says "Ha, I can still get in!"
I'm getting a little tired of blog posts that are just raw, unedited ChatGPT output, chief.
If you have arbitrary code execution, you can execute more arbitrary code on disk without calling exec. Better yet if you care about stealth is to not touch the disk at all, and keep everything in memory, downloading your next stage from a server directly into RAM.
It's nothing revolutionary, essentially change your own process for another binary, but for this you need to take over the process in the first place which is usually the hard part.
It's mildly interesting that they didn't call exec() and parse the elf manually, but that's about it.
> Run services in the tightest possible DAC/MAC sandbox with minimal caps.
That is what os dangerous, especially with containers where people run with the container root with elevated privileges.
With ollama, llama.cpp, and many other often agent containers that will run arbitrary code, and are running with the ability to bypass MACs, plus the fact that vfs and IPC isn’t really namespaces away it is complicated.
When you can’t even convince popular funded projects to add ‘USER foo’ to a dockerfile, this method is trivial.
If you looked into the state of lsms and how every complicated or difficult project is basically unconstrained it should be concerning.
~15 lines of c and ld_preload gets you privileged user namespaces on Debian based systems because of busybox as an example, which is a required package yet privileged in apparmor
What does this look like in practice? You mean you can go from root inside docker to running things outside the container? How exactly?
While it does increase the risk of Container to Host Escalation, that is not the largest problem.
In the K8s world you should expect the kubelet or any containers, regardless of namespace to be open to data leakages and there may be c2 configurations that will hide from your auditing tools.
Like much of the security industry protections are reactive, with the number of bind mounts being an indicator.
The problem is that namespaces and thus containers are not in themselves a security feature, and if you don’t implement the privilege dropping that can help them be a portion of a larger security posture you lose a lot of security.
The larger problem is that namespace support is an explicit addition to kernel features, with the default always being the global namespace.
Different teams working on different projects may not have compatible decisions.
E.g. vsock was built under a trusted hypervisor model, systemd wanted zero config for user experience reasons and OCI blocks vsock because it is insecure.
So podman, browser plugins and anything in a ‘sandbox’ have a container host ssh instance to and move horizontally.
As vsock is just another af, no special user code is needed.
Running all your containers as a single uid will elevated privileges, with global uid/gid make a needle in a haystack problem easier because you are effectively handing them a magnet
More concrete info here.
Container:
Host: The _effective_ uid of 0 in a container is the default ubuntu user 1000, note how 0 is mapped to 1000, then everything else is mapped with an offset of 100000.Footnote 2 on here is a hint on why that is a problem, note the last line.
https://www.kernel.org/doc/html/latest/admin-guide/namespace...
The cap_dac_override and cap_fowner, which the user is expected to drop, also pose a problem, from the container side.
From the host side, this very public ld_preload method still works.
https://www.openwall.com/lists/oss-security/2025/03/27/6
Note that I could use ld_preload, using weak nautilus apparmor defaults, escalate to root *in the default namespace* and mount on /etc/passwd!!!Now in a container, that won't get you to the host, but it will help you get rid of the pesky udev[/null] and other bind mounts that prevent you from extracting data from other containers running as the same UID. But I can't find a public version of that trick, so I will leave that to the reader.
The point is that for unix-like OSs, privilege dropping is where most security comes from, if you run with elevated privileges and don't drop them there are always trivial holes, and the OP shows how hard that can be to constrain.
This is how exploits always work and nothing new at all. It's like having barrier tape around a construction site as a warning when someone holds it up and says "Ha, I can still get in!"
This article sounds extremely robotic and AI generated.
I'm getting a little tired of blog posts that are just raw, unedited ChatGPT output, chief.
If you have arbitrary code execution, you can execute more arbitrary code on disk without calling exec. Better yet if you care about stealth is to not touch the disk at all, and keep everything in memory, downloading your next stage from a server directly into RAM.