On Thu, Sep 28, 2023 at 11:54:23AM +0200, Nicolas Dichtel wrote: > + Eric > > Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit : > > Hi everyone > > > > I recently ran into this problem again, and so I figured I'd ask if > > anyone has any good idea how to solve it: > > > > When running a command through 'ip netns exec', iproute2 will > > "helpfully" create a new mount namespace and remount /sys inside it, > > AFAICT to make sure /sys/class/net/* refers to the right devices inside > > the namespace. This makes sense, but unfortunately it has the side > > effect that no mount commands executed inside the ns persist. In > > particular, this makes it difficult to work with bpffs; even when > > mounting a bpffs inside the ns, it will disappear along with the > > namespace as soon as the process exits. > > > > To illustrate: > > > > # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap > > # ip netns exec <nsname> ls /sys/fs/bpf > > <nothing> > > > > This happens because namespaces are cleaned up as soon as they have no > > processes, unless they are persisted by some other means. For the > > network namespace itself, iproute2 will bind mount /proc/self/ns/net to > > /var/run/netns/<nsname> (in the root mount namespace) to persist the > > namespace. I tried implementing something similar for the mount > > namespace, but that doesn't work; I can't manually bind mount the 'mnt' > > ns reference either: > > > > # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns > > mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error. > > dmesg(1) may have more information after failed mount system call. > > > > When running strace on that mount command, it seems the move_mount() > > syscall returns EINVAL, which, AFAICT, is because the mount namespace > > file references itself as its namespace, which means it can't be > > bind-mounted into the containing mount namespace. > > > > So, my question is, how to overcome this limitation? I know it's > > possible to get a reference to the namespace of a running process, but > > there is no guarantee there is any processes running inside the > > namespace (hence the persisting bind mount for the netns). So is there > > some other way to persist the mount namespace reference, so we can pick > > it back up on the next 'ip netns' invocation? > > > > Hoping someone has a good idea :) > We ran into similar problems. The only solution we found was to use nsenter > instead of 'ip netns exec'. > > To be able to bind mount a mount namespace on a file, the directory of this file > should be private. For example: > > mkdir -p /run/foo > mount --make-rshared / > mount --bind /run/foo /run/foo > mount --make-private /run/foo > touch /run/foo/ns > unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | { > read -r pid && > mount --bind /proc/$pid/ns/mnt /run/foo/ns > } > nsenter --mount=/run/foo/ns ls / > > But this doesn't work under 'ip netns exec'. Afaiu, each ip netns exec invocation allocates a new mount namespace. If you run multiple concurrent ip netns exec command and leave them around then they all get a separate mount namespace. Not sure what the design behind that was. So even if you could persist the mount namespace of one there's still no way for ip netns exec to pick that up iiuc. So imho, the solution is to change ip netns exec to persist a mount namespace and netns namespace pair. unshare does this easily via: sudo mkdir /run/mntns sudo mount --bind /run/mntns /run/mntns sudo mount --make-slave /run/mntns sudo mkdir /run/netns sudo touch /run/mntns/mnt1 sudo touch /run/netns/net1 sudo unshare --mount=/run/mntns/mnt1 --net=/run/netns/net1 true So I'd probably patch iproute2.