On Thu, Jul 18, 2024, 15:43 Thomas Köller <thomas@xxxxxxxxxxxxxxxxxx> wrote:
Am 18.07.24 um 14:04 schrieb Mantas Mikulėnas:
> Yes, but namespace persistence actually relies on filesystem access –
> it's implemented as a bind-mount of the namespace file descriptor (onto
> /run/netns for the 'ip netns' tool), as otherwise namespaces only exist
> as long as processes that hold them.
>
> So if you have any service options that cause a new *mount* namespace to
> be created (preventing its filesystem mounts from being visible outside
> the unit), then it cannot pin persistent network namespaces.
Quoting the manual page:
ProtectSystem=
Takes a boolean argument or the special values "full" or
"strict". If true, mounts the /usr/ and the boot loader directories
(/boot and /efi) read-only for processes invoked by this unit. If set
to "full", the /etc/ directory is mounted read-only, too.
No mention of /var or /run.
It still works this way whether it's mentioned or not. Once the unit's process is put in a new mount namespace, the entire `/` is marked private so that any mounts made underneath `/` remain visible only in that namespace. This equally affects the "read-only /etc" mount done by systemd itself as well as the /run/netns mount done by 'ip' or any other mounts done anywhere else.
In theory it would be possible to carve out exceptions such as marking /run shared again, but then /run/systemd would need to be marked private again, etc. – and mount propagation across namespaces is complex enough as it is.
Also, note that the bind mounts in in
/var/run/netns and /run/netns are actually created by 'ip netns add',
they just are't usable.
No, the mount *points* in /run/netns are created (as regular empty files), but they don't become actual mounts, that's why they're not usable.
There's a distinction between mount points (files or directories seen in `ls`) and mounts (seen in `findmnt`) – make your service script log its findmnt output to a file and compare it to findmnt output seen from the outside.
(ember) /home/grawity $ mount | grep netns
tmpfs on /run/netns type tmpfs (rw,nosuid,nodev,size=3268196k,nr_inodes=819200,mode=755,inode64)
(ember) /home/grawity $ sudo systemd-run --shell -p ProtectSystem=full
Running as unit: run-u1253.service; invocation ID: 9d4675b9ef7c40d68486b3058ee8a60b
Press ^] three times within 1s to disconnect TTY.
root@ember /home/grawity # mount | grep netns
tmpfs on /run/netns type tmpfs (rw,nosuid,nodev,size=3268196k,nr_inodes=819200,mode=755,inode64)
root@ember /home/grawity # ip netns add foo
root@ember /home/grawity # mount | grep netns
tmpfs on /run/netns type tmpfs (rw,nosuid,nodev,size=3268196k,nr_inodes=819200,mode=755,inode64)
nsfs on /run/netns/foo type nsfs (rw)
root@ember /home/grawity # exit
Finished with result: success
Main processes terminated with: code=exited, status=0/SUCCESS Service runtime: 18.451s
(ember) /home/grawity $ mount | grep netns
tmpfs on /run/netns type tmpfs (rw,nosuid,nodev,size=3268196k,nr_inodes=819200,mode=755,inode64)
(ember) /home/grawity $
(The non-systemd rough equivalent is `unshare --mount --propagation=private`, and you can attach to a namespace using `nsenter` – an "ip netns exec" is approximately an `nsenter --net`.)