Christian Brauner <brauner@xxxxxxxxxx> writes: > On Thu, Sep 28, 2023 at 11:54:23AM +0200, Nicolas Dichtel wrote: >> + Eric >> >> Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit : >> > Hi everyone >> > >> > I recently ran into this problem again, and so I figured I'd ask if >> > anyone has any good idea how to solve it: >> > >> > When running a command through 'ip netns exec', iproute2 will >> > "helpfully" create a new mount namespace and remount /sys inside it, >> > AFAICT to make sure /sys/class/net/* refers to the right devices inside >> > the namespace. This makes sense, but unfortunately it has the side >> > effect that no mount commands executed inside the ns persist. In >> > particular, this makes it difficult to work with bpffs; even when >> > mounting a bpffs inside the ns, it will disappear along with the >> > namespace as soon as the process exits. >> > >> > To illustrate: >> > >> > # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap >> > # ip netns exec <nsname> ls /sys/fs/bpf >> > <nothing> >> > >> > This happens because namespaces are cleaned up as soon as they have no >> > processes, unless they are persisted by some other means. For the >> > network namespace itself, iproute2 will bind mount /proc/self/ns/net to >> > /var/run/netns/<nsname> (in the root mount namespace) to persist the >> > namespace. I tried implementing something similar for the mount >> > namespace, but that doesn't work; I can't manually bind mount the 'mnt' >> > ns reference either: >> > >> > # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns >> > mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error. >> > dmesg(1) may have more information after failed mount system call. >> > >> > When running strace on that mount command, it seems the move_mount() >> > syscall returns EINVAL, which, AFAICT, is because the mount namespace >> > file references itself as its namespace, which means it can't be >> > bind-mounted into the containing mount namespace. >> > >> > So, my question is, how to overcome this limitation? I know it's >> > possible to get a reference to the namespace of a running process, but >> > there is no guarantee there is any processes running inside the >> > namespace (hence the persisting bind mount for the netns). So is there >> > some other way to persist the mount namespace reference, so we can pick >> > it back up on the next 'ip netns' invocation? >> > >> > Hoping someone has a good idea :) >> We ran into similar problems. The only solution we found was to use nsenter >> instead of 'ip netns exec'. >> >> To be able to bind mount a mount namespace on a file, the directory of this file >> should be private. For example: >> >> mkdir -p /run/foo >> mount --make-rshared / >> mount --bind /run/foo /run/foo >> mount --make-private /run/foo >> touch /run/foo/ns >> unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | { >> read -r pid && >> mount --bind /proc/$pid/ns/mnt /run/foo/ns >> } >> nsenter --mount=/run/foo/ns ls / >> >> But this doesn't work under 'ip netns exec'. > > Afaiu, each ip netns exec invocation allocates a new mount namespace. > If you run multiple concurrent ip netns exec command and leave them > around then they all get a separate mount namespace. Not sure what the > design behind that was. So even if you could persist the mount namespace > of one there's still no way for ip netns exec to pick that up iiuc. > > So imho, the solution is to change ip netns exec to persist a mount > namespace and netns namespace pair. unshare does this easily via: > > sudo mkdir /run/mntns > sudo mount --bind /run/mntns /run/mntns > sudo mount --make-slave /run/mntns > > sudo mkdir /run/netns > > sudo touch /run/mntns/mnt1 > sudo touch /run/netns/net1 > > sudo unshare --mount=/run/mntns/mnt1 --net=/run/netns/net1 true > > So I'd probably patch iproute2. Patching iproute2 is what I'm trying to do - sorry if that wasn't clear :) However, I couldn't get it to work. I think it's probably because I was missing the bind-to-self/--make-slave dance on the containing folder, as Nicolas pointed out. Will play around with that a bit more, thanks for the pointers both of you! -Toke