On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote: > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote: > > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote: > > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote: > > > > Hi, > > > > > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an > > > > NFS > > > > share with a different user namespace. fsopen() is done in the > > > > container namespaces (user, mnt and net namespaces) while > > > > fsconfig(), > > > > fsmount() and move_mount() are done on the host namespaces. The > > > > mount > > > > on the host is available in the container via mount propagation > > > > from > > > > the host mount. > > > > > > > > With this, the files on the NFS server with uid 0 are available > > > > in > > > > the > > > > container with uid 0. On the host, they are available with uid > > > > 4294967294 (make_kuid(&init_user_ns, -2)). > > > > > > > > > > Can someone please tell me what is broken with the _current_ design > > > before we start trying to push "fixes" that clearly break it? > > Currently the mechanism of mounting nfs4 in a user namespace is as > > follows: > > > > Parent: fork() > > Child: setns(userns) > > C: fsopen("nfs4") = 3 > > C->P: Send FD 3 > > P: FSConfig... > > P: fsmount... (This is where the CAP_SYS_ADMIN check happens)) > > > > > > Right now, when you mount an NFS filesystem in a non-init user > > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which > > are sent to the server are not the UIDs from the mounting namespace, > > instead they are the UIDs from the init user ns. > > > > The reason for this is that you can call fsopen("nfs4") in the > > unprivileged > > namespace, and that configures fs_context with all the right > > information for > > that user namespace, but we currently require CAP_SYS_ADMIN in the > > init user > > namespace to call fsmount. This means that the superblock's user > > namespace is > > set "correctly" to the container, but there's absolutely no way > > nfs4uidmap > > to consume an unprivileged user namespace. > > > > This behaviour happens "the other way" as well, where the UID in the > > container > > may be 0, but the corresponding kuid is 1000. When a response from an > > NFS > > server comes in we decode it according to the idmap userns[1]. The > > userns > > used to get create idmap is generated at fsmount time, and not as > > fsopen > > time. So, even if the filesystem is in the user namespace, and the > > server > > responds with UID 0, it'll come up with an unmapped UID. > > > > This is because we do > > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS > > from_kuid(container_ns, 0) -> invalid uid > > > > This is broken behaviour, in my humble opinion as is it makes it > > impossible to > > use NFSv4 (and v3 for that matter) out of the box with unprivileged > > user > > namespaces. At least in our environment, using usernames / GSS isn't > > an option, > > so we have to rely on UIDs being set correctly [at least from the > > container's > > perspective]. > > > > The current code for setting server->cred was developed independently > of fsopen() (and predates it actually). I'm fine with the change to > have server->cred be the cred of the user that called fsopen(). That's > in line with what we used to do for sys_mount(). > Just curious, without FS_USERNS, how were you mounting NFSv4 in an unprivileged user ns? > However all the other stuff to throw errors when the user namespace is > not init_user_ns introduces massive regressions. > I can remove that and respin the patch. How do you feel about that? I would still like to keep the log lines though because it is a uapi change. I am worried that someone might exercise this path with GSS and allow for upcalls into the main namespaces by accident -- or be confused of why they're seeing upcalls "in a different namespace". Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from fs_context during mount") without any changes? I can respin ("NFSv4: Refactor NFS to use user namespaces") without: /* * nfs4idmap is not fully isolated by user namespaces. It is currently * only network namespace aware. If upcalls never happen, we do not * need to worry as nfs_client instances aren't shared between * user namespaces. */ if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && !(server->caps & NFS_CAP_UIDGID_NOMAP)) { error = -EINVAL; errorf(fc, "Mount credentials are from non init user namespace and ID mapping is enabled. This is not allowed."); goto error; } (and making it so we can call idmap_userns) > > > > > > The current design assumes that the user namespace being used is > > > the one where > > > the mount itself is performed. That means that the uids and gids or > > > usernames > > > and groupnames that go on the wire match the uids and gids of the > > > container in > > > which the mount occurred. > > > > > > > Right now, NFS does not have the ability for the fsmount() call to be > > called in an unprivileged user namespace. We can change that > > behaviour > > elsewhere if we want, but it's orthogonal to this. > > > > > The assumption is that the server has authenticated that client as > > > belonging to a domain that it recognises (either through strong > > > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP > > > addresses to a list of acceptable clients). > > > > > I added a rejection for upcalls because upcalls can happen in the > > init > > namespaces. We can drop that restriction from the nfs4 patch if you'd > > like. I > > *believe* (and I'm not a little out of my depth) that the request-key > > handler gets called with the *network namespace* of the NFS mount, > > but the userns is a privileged one, allowing for potential hazards. > > > > The idmapper already rejects upcalls to the keyring '/sbin/request-key' > utility if you're running with your own user namespace. > > Quite frankly, switching to using the keyring was a mistake which I'd > undo if I could. Aside from not supporting containers, it is horribly > slow due to requiring a full process startup/teardown for every upcall, > so it scales poorly to large numbers of identities (particularly with > an operation like readdir() in which you're doing serial upcalls). > > However nothing stops you from using the old NFSv4 idmapper daemon > (a.k.a. rpc.idmapd) in the context of the container that called > fsopen() so that it can translate identities correctly using whatever > userspace tools (ldap, sssd, winbind...) that the container has > configured. > 1. We see this as a potential security risk [this being upcalls] into the unconfined portion of the system. Although, I'm sure that the userspace handlers are written perfectly well, it allows for information leakage to occur. 2. Is there a way to do this for NFSv3? 3. Can rpc.idmapd get the user namespace that the call is from (and is the keyring per-userns?). In general, I think that this change follows the principal of least surprise. > > The reason I added that block there is that I didn't imagine anyone > > was running > > NFS in an unprivileged user namespace, and relying on upcalls > > (potentially into > > privileged namespaces) in order to do authz. > > > > > > > If you go ahead and change the user namespace on the client without > > > going through the mount process again to mount a different super > > > block > > > with a different user namespace, then you will now get the exact > > > same > > > behaviour as if you do that with any other filesystem. > > > > Not exactly, because other filesystems *only* use the s_user_ns for > > conversion > > of UIDs, whereas NFS uses the currend_cred() acquired at mount time, > > which > > doesn't match s_user_ns, leading to this behaviour. > > > > 1. Mistranslated UIDs in encoding RPCs > > 2. The UID / GID exposed to VFS do not match the user ns. > > > > > > > > -- > > > Trond Myklebust > > > Linux NFS client maintainer, Hammerspace > > > trond.myklebust@xxxxxxxxxxxxxxx > > > > > > > > -Thanks, > > Sargun > > > > [1]: > > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782 > > [2]: > > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154 > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >