On Mon, Feb 17, 2020 at 10:58 PM James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > This implementation reverse shifts according to the user_ns belonging > to the mnt_ns. So if the vfsmount has the newly introduced flag > MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns > then we shift back using the user_ns and an optional mnt_userns (which > belongs to the struct mount) before committing to the underlying > filesystem. > > For example, if a user_ns is created where interior (fake root, uid 0) > is mapped to kernel uid 100000 then writes from interior root normally > go to the filesystem at the kernel uid. However, if MNT_SHIFT is set, > they will be shifted back to write at uid 0, meaning we can bind mount > real image filesystems to user_ns protected faker root. > > In essence there are several things which have to be done for this to > occur safely. Firstly for all operations on the filesystem, new > credentials have to be installed where fsuid and fsgid are set to the > *interior* values. Next all inodes used from the filesystem have to > have i_uid and i_gid shifted back to the kernel values and attributes > set from user space have to have ia_uid and ia_gid shifted from the > kernel values to the interior values. The capability checks have to > be done using ns_capable against the kernel values, but the inode > capability checks have to be done against the shifted ids. > > Since creating a new credential is a reasonably expensive proposition > and we have to shift and unshift many times during path walking, a > cached copy of the shifted credential is saved to a newly created > place in the task structure. This serves the dual purpose of allowing > us to use a pre-prepared copy of the shifted credentials and also > allows us to recognise whenever the shift is actually in effect (the > cached shifted credential pointer being equal to the current_cred() > pointer). > > To get this all to work, we have a check for the vfsmount flag and the > user_ns gating a shifting of the credentials over all user space > entries to filesystem functions. In theory the path has to be present > everywhere we do this, so we can check the vfsmount flags. However, > for lower level functions we can cheat this path check of vfsmount > simply to check whether a shifted credential is in effect or not to > gate things like the inode permission check, which means the path > doesn't have to be threaded all the way through the permission > checking functions. if the credential is shifted check passes, we can > also be sure that the current user_ns is the same as the mnt->user_ns, > so we can use it and thus have no need of the struct mount at the > point of the shift. > > Although the shift can be effected simply by executing > do_reconfigure_mnt with MNT_SHIFT in the flags, this patch only > contains the shifting mechanisms. The follow on patch wires up the > user visible API for turning the flag on. > > Signed-off-by: James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> > > --- [...] > @@ -3828,6 +3884,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode, > if (IS_ERR(dentry)) > return PTR_ERR(dentry); > > + cred = change_userns_creds(&path); > if (!IS_POSIXACL(path.dentry->d_inode)) > mode &= ~current_umask(); > error = security_path_mknod(&path, dentry, mode, dev); [...] > + cred = change_userns_creds(&path); > if (!IS_POSIXACL(path.dentry->d_inode)) > mode &= ~current_umask(); > error = security_path_mkdir(&path, dentry, mode); [...] > + cred = change_userns_creds(&path); > error = security_path_symlink(&path, dentry, from->name); I see a pattern above. Perhaps change_userns_creds() should be inside security_path_XXX hooks? Perhaps auto-shifting bind mount should be implemented by an LSM? After, all "gating" access to filesystem, is part of what LSMs do and uid (or fsid) shifting is a sort of "gating". Heck, there should already be a way to attach a security context to a mount, right? So you don't even need a new UAPI in order to configure the auto-shifting LSM. And you could use standard security.* xattr for persistent configuration of the auto-shifting filesystem sections, which is something that you wanted to solve anyway, right? Apologies if my suggestions are flawed with misunderstanding of the feature. Thanks, Amir.