Hi, I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS share with a different user namespace. fsopen() is done in the container namespaces (user, mnt and net namespaces) while fsconfig(), fsmount() and move_mount() are done on the host namespaces. The mount on the host is available in the container via mount propagation from the host mount. With this, the files on the NFS server with uid 0 are available in the container with uid 0. On the host, they are available with uid 4294967294 (make_kuid(&init_user_ns, -2)). The code to reproduce my test is available at: https://github.com/kinvolk/nfs-mount-in-userns And the results and traces are attached at the end. While the basic feature works, I have some thoughts. First, doing the fsopen() in the container namespaces implements two features in one step: 1. Selection of the userns for the id mapping translation. 2. Selection of the netns for the connection to the NFS server. I was wondering if this only considers the scenario where the user wants to make the connection to the NFS server from the network namespace of the container. I think there is another valid use case to use the userns of the container but the netns of the host or a third-party netns. We can use the correct set of setns() to do the fsopen() in the container userns but in the host netns, but then we’d be in a netns that does not belong to the current userns, so we would not have any capability over it. In my tests, that seems to work fine when the netns and the userns of the fs_context are not related. Still, I would find the API cleaner if the userns and netns were selected explicitly with something like: sfd = fsopen("nfs4", FSOPEN_CLOEXEC); usernsfd = pidfd_open(...); or usernsfd = open(“/proc/pid/ns/user”) fsconfig(sfd, FSCONFIG_SET_FD, "userns", NULL, usernsfd); netnsfd = pidfd_open(...); or netnsfd = open(“/proc/pid/ns/net”) fsconfig(sfd, FSCONFIG_SET_FD, "netns", NULL, netnsfd); This would avoid the need for fd passing after the fsopen(). This would require fsconfig() (possibly in nfs_fs_context_parse_param()) to do the capability check but making it more explicit sounds better to me. Second, the capability check in fsopen() is the following: if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)) This means that we cannot just create a temporary userns with the desired id mapping, but we additionally need to enter a mntns owned by the userns. However the code in fsopen() does not seem to do anything with the mntns (The new mount will only be associated with the current mntns at move_mount() time), so we could just create a temporary userns + mntns. It seems weird to me that the capability check is done in relation to the current mntns even though the code does not do anything with it. In Kubernetes, the NFS mount is done before creating the user namespace. pkg/kubelet/kubelet.go Kubelet's syncPod() will do the following in this order: 1. Mount the volumes with CSI or other volume implementations: WaitForAttachAndMount() line 1667 2. Call the CRI's createPodSandbox via kl.containerRuntime.SyncPod() line 1678 to create the user namespace and network namespace. This means that at the time of the NFS mount, we have not yet created the user namespace or the network namespace, and even less configured it with the CNI plugin. With this API where the id mapping for the NFS mount is decided at the superblock level, we would need to refactor the Kubelet code to be able to call the CSI mount after the creation of the sandbox, and after the configuration with CNI. This will be more complicated to integrate in Kubernetes than the idmapped mounts patch set where the id mapping is set at the bind mount level (https://lists.linuxfoundation.org/pipermail/containers/2020-October/042477.html). However, it is less invasive. This approach works for NFS volumes in Kubernetes but would not work with other volumes like hostPath (bind mount from the host) where we don’t have a new superblock. Lastly, I checked the implementation of nfs_compare_super() and it seems fine. In Kubernetes, we want to be able to create several Kubernetes pods with different userns and mount the same NFS share in several pods. The kernel will have to create different NFS superblocks for that scenario and it does that correctly in nfs_compare_super() by comparing the userns and comparing the netns as well. ----- Running ./nfs-mount-in-userns strace: Process 4022 attached [pid 4022] fsopen("nfs4", FSOPEN_CLOEXEC) = 6 [pid 4022] +++ exited with 0 +++ --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=4022, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- fsconfig(7, FSCONFIG_SET_STRING, "source", "127.0.0.1:/server", 0) = 0 fsconfig(7, FSCONFIG_SET_STRING, "vers", "4.2", 0) = 0 fsconfig(7, FSCONFIG_SET_STRING, "addr", "127.0.0.1", 0) = 0 fsconfig(7, FSCONFIG_SET_STRING, "clientaddr", "127.0.0.1", 0) = 0 fsconfig(7, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(7, FSMOUNT_CLOEXEC, 0) = 6 move_mount(6, "", AT_FDCWD, "/mnt/nfs", MOVE_MOUNT_F_EMPTY_PATH) = 0 +++ exited with 0 +++ ./nfs-mount-in-userns returned 0 last dmesg line about nfs4_create_server [55258.702256] nfs4_create_server: Using creds from non-init userns 459 55 0:40 / /mnt/nfs rw,relatime shared:187 - nfs4 127.0.0.1:/server rw,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 + : 'Files on the NFS server:' + ls -lani /server/ total 20 1048578 drwxr-xr-x. 5 0 0 4096 Nov 10 09:19 . 2 dr-xr-xr-x. 21 0 0 4096 Nov 9 14:25 .. 1048582 drwx------. 2 0 0 4096 Nov 10 09:19 dir-0 1048583 drwx------. 2 1000 1000 4096 Nov 10 09:19 dir-1000 1048584 drwx------. 2 3000 3000 4096 Nov 10 09:19 dir-3000 1048579 -rw-------. 1 0 0 0 Nov 10 09:19 file-0 1048580 -rw-------. 1 1000 1000 0 Nov 10 09:19 file-1000 1048581 -rw-------. 1 3000 3000 0 Nov 10 09:19 file-3000 + : 'Files on the NFS mountpoint (from container PoV):' + nsenter -U -m -n -t 4002 sh -c 'ls -lani /mnt/nfs' total 20 1048578 drwxr-xr-x. 5 0 0 4096 Nov 10 09:19 . 786433 drwxr-xr-x. 3 65534 65534 4096 May 16 16:08 .. 1048582 drwx------. 2 0 0 4096 Nov 10 09:19 dir-0 1048583 drwx------. 2 65534 65534 4096 Nov 10 09:19 dir-1000 1048584 drwx------. 2 65534 65534 4096 Nov 10 09:19 dir-3000 1048579 -rw-------. 1 0 0 0 Nov 10 09:19 file-0 1048580 -rw-------. 1 65534 65534 0 Nov 10 09:19 file-1000 1048581 -rw-------. 1 65534 65534 0 Nov 10 09:19 file-3000 + : 'Files on the NFS mountpoint (from host PoV):' + ls -lani /mnt/nfs/ total 20 1048578 drwxr-xr-x. 5 1000 1000 4096 Nov 10 09:19 . 786433 drwxr-xr-x. 3 0 0 4096 May 16 16:08 .. 1048582 drwx------. 2 1000 1000 4096 Nov 10 09:19 dir-0 1048583 drwx------. 2 4294967294 4294967294 4096 Nov 10 09:19 dir-1000 1048584 drwx------. 2 4294967294 4294967294 4096 Nov 10 09:19 dir-3000 1048579 -rw-------. 1 1000 1000 0 Nov 10 09:19 file-0 1048580 -rw-------. 1 4294967294 4294967294 0 Nov 10 09:19 file-1000 1048581 -rw-------. 1 4294967294 4294967294 0 Nov 10 09:19 file-3000 Alban On Mon, 2 Nov 2020 at 18:48, Sargun Dhillon <sargun@xxxxxxxxx> wrote: > > This is effectively a resend, but re-based atop Anna's current tree. I can > add the samples back in an another patchset. > > Right now, it is possible to mount NFS with an non-matching super block > user ns, and NFS sunrpc user ns. This (for the user) results in an awkward > set of interactions if using anything other than auth_null, where the UIDs > being sent to the server are different than the local UIDs being checked. > This can cause "breakage", where if you try to communicate with the NFS > server with any other set of mappings, it breaks. > > This is after the initial v5.10 merge window, so hopefully this patchset > can be reconsidered, and maybe we can make forward progress? I think that > it takes a relatively conservative approach in enabling user namespaces, > and it prevents the case where someone is using auth_gss (for now), as the > mappings are non-trivial. > > Changes since v3: > * Rebase atop Anna's tree > Changes since v2: > * Removed samples > * Split out NFSv2/v3 patchset from NFSv4 patchset > * Added restrictions around use > Changes since v1: > * Added samples > > Sargun Dhillon (2): > NFS: NFSv2/NFSv3: Use cred from fs_context during mount > NFSv4: Refactor NFS to use user namespaces > > fs/nfs/client.c | 10 ++++++++-- > fs/nfs/nfs4client.c | 27 ++++++++++++++++++++++++++- > fs/nfs/nfs4idmap.c | 2 +- > fs/nfs/nfs4idmap.h | 3 ++- > 4 files changed, 37 insertions(+), 5 deletions(-) > > > base-commit: 8c39076c276be0b31982e44654e2c2357473258a > -- > 2.25.1 >