+ Eric Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit : > Hi everyone > > I recently ran into this problem again, and so I figured I'd ask if > anyone has any good idea how to solve it: > > When running a command through 'ip netns exec', iproute2 will > "helpfully" create a new mount namespace and remount /sys inside it, > AFAICT to make sure /sys/class/net/* refers to the right devices inside > the namespace. This makes sense, but unfortunately it has the side > effect that no mount commands executed inside the ns persist. In > particular, this makes it difficult to work with bpffs; even when > mounting a bpffs inside the ns, it will disappear along with the > namespace as soon as the process exits. > > To illustrate: > > # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap > # ip netns exec <nsname> ls /sys/fs/bpf > <nothing> > > This happens because namespaces are cleaned up as soon as they have no > processes, unless they are persisted by some other means. For the > network namespace itself, iproute2 will bind mount /proc/self/ns/net to > /var/run/netns/<nsname> (in the root mount namespace) to persist the > namespace. I tried implementing something similar for the mount > namespace, but that doesn't work; I can't manually bind mount the 'mnt' > ns reference either: > > # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns > mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error. > dmesg(1) may have more information after failed mount system call. > > When running strace on that mount command, it seems the move_mount() > syscall returns EINVAL, which, AFAICT, is because the mount namespace > file references itself as its namespace, which means it can't be > bind-mounted into the containing mount namespace. > > So, my question is, how to overcome this limitation? I know it's > possible to get a reference to the namespace of a running process, but > there is no guarantee there is any processes running inside the > namespace (hence the persisting bind mount for the netns). So is there > some other way to persist the mount namespace reference, so we can pick > it back up on the next 'ip netns' invocation? > > Hoping someone has a good idea :) We ran into similar problems. The only solution we found was to use nsenter instead of 'ip netns exec'. To be able to bind mount a mount namespace on a file, the directory of this file should be private. For example: mkdir -p /run/foo mount --make-rshared / mount --bind /run/foo /run/foo mount --make-private /run/foo touch /run/foo/ns unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | { read -r pid && mount --bind /proc/$pid/ns/mnt /run/foo/ns } nsenter --mount=/run/foo/ns ls / But this doesn't work under 'ip netns exec'. Regards, Nicolas