On Thu, Dec 02, 2021 at 11:50:43AM -0600, Seth Forshee wrote: > On Tue, Nov 30, 2021 at 01:10:32PM +0100, Christian Brauner wrote: > > From: Christian Brauner <christian.brauner@xxxxxxxxxx> > > > > In previous patches we added new and modified existing helpers to handle > > idmapped mounts of filesystems mounted with an idmapping. In this final > > patch we convert all relevant places in the vfs to actually pass the > > filesystem's idmapping into these helpers. > > > > With this the vfs is in shape to handle idmapped mounts of filesystems > > mounted with an idmapping. Note that this is just the generic > > infrastructure. Actually adding support for idmapped mounts to a > > filesystem mountable with an idmapping is follow-up work. > > > > In this patch we extend the definition of an idmapped mount from a mount > > that that has the initial idmapping attached to it to a mount that has > > an idmapping attached to it which is not the same as the idmapping the > > filesystem was mounted with. > > > > As before we do not allow the initial idmapping to be attached to a > > mount. In addition this patch prevents that the idmapping the filesystem > > was mounted with can be attached to a mount created based on this > > filesystem. > > > > This has multiple reasons and advantages. First, attaching the initial > > idmapping or the filesystem's idmapping doesn't make much sense as in > > both cases the values of the i_{g,u}id and other places where k{g,u}ids > > are used do not change. Second, a user that really wants to do this for > > whatever reason can just create a separate dedicated identical idmapping > > to attach to the mount. Third, we can continue to use the initial > > idmapping as an indicator that a mount is not idmapped allowing us to > > continue to keep passing the initial idmapping into the mapping helpers > > to tell them that something isn't an idmapped mount even if the > > filesystem is mounted with an idmapping. > > > > Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@xxxxxxxxxx (v1) > > Cc: Seth Forshee <sforshee@xxxxxxxxxxxxxxxx> > > Cc: Amir Goldstein <amir73il@xxxxxxxxx> > > Cc: Christoph Hellwig <hch@xxxxxx> > > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > CC: linux-fsdevel@xxxxxxxxxxxxxxx > > Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx> > > --- > > /* v2 */ > > - Amir Goldstein <amir73il@xxxxxxxxx>: > > - Continue passing down the initial idmapping to xfs. Since xfs cannot > > and is not planned to support idmapped superblocks don't mislead > > readers in the code by passing down the filesystem idmapping. It > > will be the initial idmapping always anyway. > > - Include mnt_idmapping.h header into fs/namespace.c > > --- > > fs/namespace.c | 37 +++++++++++++++++++++++++------------ > > fs/open.c | 7 ++++--- > > fs/posix_acl.c | 8 ++++---- > > include/linux/fs.h | 17 +++++++++-------- > > security/commoncap.c | 9 ++++----- > > 5 files changed, 46 insertions(+), 32 deletions(-) > > > > diff --git a/fs/namespace.c b/fs/namespace.c > > index 4994b816a74c..ccefede4ba1b 100644 > > --- a/fs/namespace.c > > +++ b/fs/namespace.c > > @@ -31,6 +31,7 @@ > > #include <uapi/linux/mount.h> > > #include <linux/fs_context.h> > > #include <linux/shmem_fs.h> > > +#include <linux/mnt_idmapping.h> > > > > #include "pnode.h" > > #include "internal.h" > > @@ -561,7 +562,7 @@ static void free_vfsmnt(struct mount *mnt) > > struct user_namespace *mnt_userns; > > > > mnt_userns = mnt_user_ns(&mnt->mnt); > > - if (mnt_userns != &init_user_ns) > > + if (!initial_mapping(mnt_userns)) > > put_user_ns(mnt_userns); > > kfree_const(mnt->mnt_devname); > > #ifdef CONFIG_SMP > > @@ -965,6 +966,7 @@ static struct mount *skip_mnt_tree(struct mount *p) > > struct vfsmount *vfs_create_mount(struct fs_context *fc) > > { > > struct mount *mnt; > > + struct user_namespace *fs_userns; > > > > if (!fc->root) > > return ERR_PTR(-EINVAL); > > @@ -982,6 +984,10 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc) > > mnt->mnt_mountpoint = mnt->mnt.mnt_root; > > mnt->mnt_parent = mnt; > > > > + fs_userns = mnt->mnt.mnt_sb->s_user_ns; > > + if (!initial_mapping(fs_userns)) > > + mnt->mnt.mnt_userns = get_user_ns(fs_userns); > > + > > Won't this get be leaked if mnt_userns is overwritten by > do_idmap_mount()? Yep, good catch. Thanks for spotting this! I fixed that up in the tree. We need a put in do_idmap_mount() as you've mentioned for the !initial_mapping() case. > > > lock_mount_hash(); > > list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts); > > unlock_mount_hash(); > > @@ -1072,7 +1078,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, > > > > atomic_inc(&sb->s_active); > > mnt->mnt.mnt_userns = mnt_user_ns(&old->mnt); > > - if (mnt->mnt.mnt_userns != &init_user_ns) > > + if (!initial_mapping(mnt->mnt.mnt_userns)) > > mnt->mnt.mnt_userns = get_user_ns(mnt->mnt.mnt_userns); > > mnt->mnt.mnt_sb = sb; > > mnt->mnt.mnt_root = dget(root); > > @@ -3927,10 +3933,18 @@ static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt) > > static int can_idmap_mount(const struct mount_kattr *kattr, struct mount *mnt) > > { > > struct vfsmount *m = &mnt->mnt; > > + struct user_namespace *fs_userns = m->mnt_sb->s_user_ns; > > > > if (!kattr->mnt_userns) > > return 0; > > > > + /* > > + * Creating an idmapped mount with the filesystem wide idmapping > > + * doesn't make sense so block that. We don't allow mushy semantics. > > + */ > > + if (kattr->mnt_userns == fs_userns) > > + return -EINVAL; > > + > > /* > > * Once a mount has been idmapped we don't allow it to change its > > * mapping. It makes things simpler and callers can just create > > @@ -3943,12 +3957,8 @@ static int can_idmap_mount(const struct mount_kattr *kattr, struct mount *mnt) > > if (!(m->mnt_sb->s_type->fs_flags & FS_ALLOW_IDMAP)) > > return -EINVAL; > > > > - /* Don't yet support filesystem mountable in user namespaces. */ > > - if (m->mnt_sb->s_user_ns != &init_user_ns) > > - return -EINVAL; > > - > > /* We're not controlling the superblock. */ > > - if (!capable(CAP_SYS_ADMIN)) > > + if (!ns_capable(fs_userns, CAP_SYS_ADMIN)) > > return -EPERM; > > > > /* Mount has already been visible in the filesystem hierarchy. */ > > @@ -4133,13 +4143,16 @@ static int build_mount_idmapped(const struct mount_attr *attr, size_t usize, > > } > > > > /* > > - * The init_user_ns is used to indicate that a vfsmount is not idmapped. > > - * This is simpler than just having to treat NULL as unmapped. Users > > - * wanting to idmap a mount to init_user_ns can just use a namespace > > - * with an identity mapping. > > + * The initial idmapping cannot be used to create an idmapped > > + * mount. Attaching the initial idmapping doesn't make much sense > > + * as it is an identity mapping. A user can just create a dedicated > > + * identity mapping to achieve the same result. We also use the > > + * initial idmapping as an indicator of a mount that is not > > + * idmapped. It can simply be passed into helpers that are aware of > > + * idmapped mounts as a convenient shortcut. > > */ > > This isn't completely accurate, is it? If sb->s_user_ns != init_user_ns, > an idmap to init_user_ns isn't an idendity mapping. Yep, that's correct. Let me reword that. Thank you for the reviews! Christian