On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote: > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote: > > Hi, > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS > > share with a different user namespace. fsopen() is done in the > > container namespaces (user, mnt and net namespaces) while fsconfig(), > > fsmount() and move_mount() are done on the host namespaces. The mount > > on the host is available in the container via mount propagation from > > the host mount. > > > > With this, the files on the NFS server with uid 0 are available in > > the > > container with uid 0. On the host, they are available with uid > > 4294967294 (make_kuid(&init_user_ns, -2)). > > > > Can someone please tell me what is broken with the _current_ design > before we start trying to push "fixes" that clearly break it? Currently the mechanism of mounting nfs4 in a user namespace is as follows: Parent: fork() Child: setns(userns) C: fsopen("nfs4") = 3 C->P: Send FD 3 P: FSConfig... P: fsmount... (This is where the CAP_SYS_ADMIN check happens)) Right now, when you mount an NFS filesystem in a non-init user namespace, and you have UIDs / GIDs on, the UIDs / GIDs which are sent to the server are not the UIDs from the mounting namespace, instead they are the UIDs from the init user ns. The reason for this is that you can call fsopen("nfs4") in the unprivileged namespace, and that configures fs_context with all the right information for that user namespace, but we currently require CAP_SYS_ADMIN in the init user namespace to call fsmount. This means that the superblock's user namespace is set "correctly" to the container, but there's absolutely no way nfs4uidmap to consume an unprivileged user namespace. This behaviour happens "the other way" as well, where the UID in the container may be 0, but the corresponding kuid is 1000. When a response from an NFS server comes in we decode it according to the idmap userns[1]. The userns used to get create idmap is generated at fsmount time, and not as fsopen time. So, even if the filesystem is in the user namespace, and the server responds with UID 0, it'll come up with an unmapped UID. This is because we do Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS from_kuid(container_ns, 0) -> invalid uid This is broken behaviour, in my humble opinion as is it makes it impossible to use NFSv4 (and v3 for that matter) out of the box with unprivileged user namespaces. At least in our environment, using usernames / GSS isn't an option, so we have to rely on UIDs being set correctly [at least from the container's perspective]. > > The current design assumes that the user namespace being used is the one where > the mount itself is performed. That means that the uids and gids or usernames > and groupnames that go on the wire match the uids and gids of the container in > which the mount occurred. > Right now, NFS does not have the ability for the fsmount() call to be called in an unprivileged user namespace. We can change that behaviour elsewhere if we want, but it's orthogonal to this. > The assumption is that the server has authenticated that client as > belonging to a domain that it recognises (either through strong > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP > addresses to a list of acceptable clients). > I added a rejection for upcalls because upcalls can happen in the init namespaces. We can drop that restriction from the nfs4 patch if you'd like. I *believe* (and I'm not a little out of my depth) that the request-key handler gets called with the *network namespace* of the NFS mount, but the userns is a privileged one, allowing for potential hazards. The reason I added that block there is that I didn't imagine anyone was running NFS in an unprivileged user namespace, and relying on upcalls (potentially into privileged namespaces) in order to do authz. > If you go ahead and change the user namespace on the client without > going through the mount process again to mount a different super block > with a different user namespace, then you will now get the exact same > behaviour as if you do that with any other filesystem. Not exactly, because other filesystems *only* use the s_user_ns for conversion of UIDs, whereas NFS uses the currend_cred() acquired at mount time, which doesn't match s_user_ns, leading to this behaviour. 1. Mistranslated UIDs in encoding RPCs 2. The UID / GID exposed to VFS do not match the user ns. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > > -Thanks, Sargun [1]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782 [2]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154