On Thu, Jul 07, 2022 at 09:58:47AM +0200, Miklos Szeredi wrote: > On Wed, 6 Jul 2022 at 15:59, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > This cycle we added support for mounting overlayfs on top of idmapped mounts. > > Recently I've started looking into potential corner cases when trying to add > > additional tests and I noticed that reporting for POSIX ACLs is currently wrong > > when using idmapped layers with overlayfs mounted on top of it. > > > > I have sent out an patch that fixes this and makes POSIX ACLs work correctly > > but the patch is a bit bigger and we're already at -rc5 so I recommend we > > simply don't raise SB_POSIXACL when idmapped layers are used. Then we can fix > > the VFS part described below for the next merge window so we can have good > > exposure in -next. > > > > I'm going to give a rather detailed explanation to both the origin of the > > problem and mention the solution so people know what's going on. > > > > Let's assume the user creates the following directory layout and they have a > > rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would > > expect files on your host system to be owned. For example, ~/.bashrc for your > > regular user would be owned by 1000:1000 and /root/.bashrc would be owned by > > 0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs > > filesystem. > > > > The user chooses to set POSIX ACLs using the setfacl binary granting the user > > with uid 4 read, write, and execute permissions for their .bashrc file: > > > > setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc > > > > Now they to expose the whole rootfs to a container using an idmapped mount. So > > they first create: > > > > mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap} > > mkdir -pv /vol/contpool/ctrover/{over,work} > > chown 10000000:10000000 /vol/contpool/ctrover/{over,work} > > > > The user now creates an idmapped mount for the rootfs: > > > > mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \ > > /var/lib/lxc/c2/rootfs \ > > /vol/contpool/lowermap > > > > This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc > > which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at > > /vol/contpool/lowermap/home/ubuntu/.bashrc. > > > > Assume the user wants to expose these idmapped mounts through an overlayfs > > mount to a container. > > > > mount -t overlay overlay \ > > -o lowerdir=/vol/contpool/lowermap, \ > > upperdir=/vol/contpool/overmap/over, \ > > workdir=/vol/contpool/overmap/work \ > > /vol/contpool/merge > > > > The user can do this in two ways: > > > > (1) Mount overlayfs in the initial user namespace and expose it to the > > container. > > (2) Mount overlayfs on top of the idmapped mounts inside of the container's > > user namespace. > > > > Let's assume the user chooses the (1) option and mounts overlayfs on the host > > and then changes into a container which uses the idmapping 0:10000000:65536 > > which is the same used for the two idmapped mounts. > > > > Now the user tries to retrieve the POSIX ACLs using the getfacl command > > > > getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc > > > > and to their surprise they see: > > > > # file: vol/contpool/merge/home/ubuntu/.bashrc > > # owner: 1000 > > # group: 1000 > > user::rw- > > user:4294967295:rwx > > group::r-- > > mask::rwx > > other::r-- > > > > indicating the the uid wasn't correctly translated according to the idmapped > > mount. The problem is how we currently translate POSIX ACLs. Let's inspect the > > callchain in this example: > > > > idmapped mount /vol/contpool/merge: 0:10000000:65536 > > caller's idmapping: 0:10000000:65536 > > overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ > > > > sys_getxattr() > > -> path_getxattr() > > -> getxattr() > > -> do_getxattr() > > |> vfs_getxattr() > > | -> __vfs_getxattr() > > | -> handler->get == ovl_posix_acl_xattr_get() > > | -> ovl_xattr_get() > > | -> vfs_getxattr() > > | -> __vfs_getxattr() > > | -> handler->get() /* lower filesystem callback */ > > |> posix_acl_fix_xattr_to_user() > > { > > 4 = make_kuid(&init_user_ns, 4); > > 4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4); > > /* FAILURE */ > > -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); > > } > > > > If the user chooses to use option (2) and mounts overlayfs on top of idmapped > > mounts inside the container things don't look that much better: > > > > idmapped mount /vol/contpool/merge: 0:10000000:65536 > > caller's idmapping: 0:10000000:65536 > > overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 > > > > sys_getxattr() > > -> path_getxattr() > > -> getxattr() > > -> do_getxattr() > > |> vfs_getxattr() > > | -> __vfs_getxattr() > > | -> handler->get == ovl_posix_acl_xattr_get() > > | -> ovl_xattr_get() > > | -> vfs_getxattr() > > | -> __vfs_getxattr() > > | -> handler->get() /* lower filesystem callback */ > > |> posix_acl_fix_xattr_to_user() > > { > > 4 = make_kuid(&init_user_ns, 4); > > 4 = mapped_kuid_fs(&init_user_ns, 4); > > /* FAILURE */ > > -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); > > } > > > > As is easily seen the problem arises because the idmapping of the lower mount > > isn't taken into account as all of this happens in do_gexattr(). But > > do_getxattr() is always called on an overlayfs mount and inode and thus cannot > > possible take the idmapping of the lower layers into account. > > > > This problem is similar for fscaps but there the translation happens as part of > > vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain: > > > > setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc > > > > The expected outcome here is that we'll receive the cap_net_raw capability as > > we are able to map the uid associated with the fscap to 0 within our container. > > IOW, we want to see 0 as the result of the idmapping translations. > > > > If the user chooses option (1) we get the following callchain for fscaps: > > > > idmapped mount /vol/contpool/merge: 0:10000000:65536 > > caller's idmapping: 0:10000000:65536 > > overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ > > > > sys_getxattr() > > -> path_getxattr() > > -> getxattr() > > -> do_getxattr() > > -> vfs_getxattr() > > -> xattr_getsecurity() > > -> security_inode_getsecurity() ________________________________ > > -> cap_inode_getsecurity() | | > > { V | > > 10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | > > 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | > > /* Expected result is 0 and thus that we own the fscap. */ | > > 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | > > } | > > -> vfs_getxattr_alloc() | > > -> handler->get == ovl_other_xattr_get() | > > -> vfs_getxattr() | > > -> xattr_getsecurity() | > > -> security_inode_getsecurity() | > > -> cap_inode_getsecurity() | > > { | > > 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | > > 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | > > 10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | > > |____________________________________________________________________| > > } > > -> vfs_getxattr_alloc() > > -> handler->get == /* lower filesystem callback */ > > > > And if the user chooses option (2) we get: > > > > idmapped mount /vol/contpool/merge: 0:10000000:65536 > > caller's idmapping: 0:10000000:65536 > > overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 > > > > sys_getxattr() > > -> path_getxattr() > > -> getxattr() > > -> do_getxattr() > > -> vfs_getxattr() > > -> xattr_getsecurity() > > -> security_inode_getsecurity() _______________________________ > > -> cap_inode_getsecurity() | | > > { V | > > 10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); | > > 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | > > /* Expected result is 0 and thus that we own the fscap. */ | > > 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | > > } | > > -> vfs_getxattr_alloc() | > > -> handler->get == ovl_other_xattr_get() | > > |-> vfs_getxattr() | > > -> xattr_getsecurity() | > > -> security_inode_getsecurity() | > > -> cap_inode_getsecurity() | > > { | > > 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | > > 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | > > 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); | > > |____________________________________________________________________| > > } > > -> vfs_getxattr_alloc() > > -> handler->get == /* lower filesystem callback */ > > > > We can see how the translation happens correctly in those cases as the > > conversion happens within the vfs_getxattr() helper. > > > > For POSIX ACLs we need to do something similar. However, in contrast to fscaps > > we cannot apply the fix directly to the kernel internal posix acl data > > structure as this would alter the cached values and would also require a rework > > of how we currently deal with POSIX ACLs in general which almost never take the > > filesystem idmapping into account (the noteable exception being FUSE but even > > there the implementation is special) and instead retrieve the raw values based > > on the initial idmapping. > > > > The correct values are then generated right before returning to userspace. The > > fix for this is to move taking the mount's idmapping into account directly in > > vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user(). > > > > To this end we simply move the idmapped mount translation into a separate step > > performed in vfs_{g,s}etxattr() instead of in > > posix_acl_fix_xattr_{from,to}_user(). > > > > To see how this fixes things let's go back to the original example. Assume the > > user chose option (1) and mounted overlayfs on top of idmapped mounts on the > > host: > > > > idmapped mount /vol/contpool/merge: 0:10000000:65536 > > caller's idmapping: 0:10000000:65536 > > overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ > > > > sys_getxattr() > > -> path_getxattr() > > -> getxattr() > > -> do_getxattr() > > |> vfs_getxattr() > > | |> __vfs_getxattr() > > | | -> handler->get == ovl_posix_acl_xattr_get() > > | | -> ovl_xattr_get() > > | | -> vfs_getxattr() > > | | |> __vfs_getxattr() > > | | | -> handler->get() /* lower filesystem callback */ > > | | |> posix_acl_getxattr_idmapped_mnt() > > | | { > > | | 4 = make_kuid(&init_user_ns, 4); > > | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); > > | | 10000004 = from_kuid(&init_user_ns, 10000004); > > | | |_______________________ > > | | } | > > | | | > > | |> posix_acl_getxattr_idmapped_mnt() | > > | { | > > | V > > | 10000004 = make_kuid(&init_user_ns, 10000004); > > | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); > > | 10000004 = from_kuid(&init_user_ns, 10000004); > > | } |_________________________________________________ > > | | > > | | > > |> posix_acl_fix_xattr_to_user() | > > { V > > 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); > > /* SUCCESS */ > > 4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004); > > } > > > > And similarly if the user chooses option (1) and mounted overayfs on top of > > idmapped mounts inside the container: > > > > idmapped mount /vol/contpool/merge: 0:10000000:65536 > > caller's idmapping: 0:10000000:65536 > > overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 > > > > sys_getxattr() > > -> path_getxattr() > > -> getxattr() > > -> do_getxattr() > > |> vfs_getxattr() > > | |> __vfs_getxattr() > > | | -> handler->get == ovl_posix_acl_xattr_get() > > | | -> ovl_xattr_get() > > | | -> vfs_getxattr() > > | | |> __vfs_getxattr() > > | | | -> handler->get() /* lower filesystem callback */ > > | | |> posix_acl_getxattr_idmapped_mnt() > > | | { > > | | 4 = make_kuid(&init_user_ns, 4); > > | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); > > | | 10000004 = from_kuid(&init_user_ns, 10000004); > > | | |_______________________ > > | | } | > > | | | > > | |> posix_acl_getxattr_idmapped_mnt() | > > | { V > > | 10000004 = make_kuid(&init_user_ns, 10000004); > > | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); > > | 10000004 = from_kuid(0(&init_user_ns, 10000004); > > | |_________________________________________________ > > | } | > > | | > > |> posix_acl_fix_xattr_to_user() | > > { V > > 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); > > /* SUCCESS */ > > 4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004); > > } > > > > The last remaining problem we need to fix here is ovl_get_acl(). During > > ovl_permission() overlayfs will call: > > > > ovl_permission() > > -> generic_permission() > > -> acl_permission_check() > > -> check_acl() > > -> get_acl() > > -> inode->i_op->get_acl() == ovl_get_acl() > > > get_acl() /* on the underlying filesystem) > > ->inode->i_op->get_acl() == /*lower filesystem callback */ > > -> posix_acl_permission() > > > > passing through the get_acl request to the underlying filesystem. This will > > retrieve the acls stored in the lower filesystem without taking the idmapping > > of the underlying mount into account as this would mean altering the cached > > values for the lower filesystem. The simple solution is to have ovl_get_acl() > > simply duplicate the ACLs, update the values according to the idmapped mount > > and return it to acl_permission_check() so it can be used in > > posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be released > > right after. > > > > Link: https://github.com/brauner/mount-idmapped/issues/9 > > Cc: Seth Forshee <sforshee@xxxxxxxxxxxxxxxx> > > Cc: Amir Goldstein <amir73il@xxxxxxxxx> > > Cc: Vivek Goyal <vgoyal@xxxxxxxxxx> > > Cc: Christoph Hellwig <hch@xxxxxx> > > Cc: Aleksa Sarai <cyphar@xxxxxxxxxx> > > Cc: Miklos Szeredi <mszeredi@xxxxxxxxxx> > > Cc: linux-unionfs@xxxxxxxxxxxxxxx > > Signed-off-by: Christian Brauner (Microsoft) <brauner@xxxxxxxxxx> > > --- > > Hey Miklos, > > > > I describe in detail how I'm going to fix this with the series I intend > > to get ready for the next merge window in the commit message. > > > > I would just turn off POSIX ACLs until then. Would you be ok with that > > and route this to Linus this or next week? > > Sounds good. > > However I don't think clearing SB_POSIXACL will do that. > > Maybe denying the operation in ovl_posix_acl_xattr_{get,set}() is the > right way to achieve the above? Hm, removing SB_POSIXACL in my tests fixed that completely. But we can add an additional check: if (!IS_POSIXACL(inode)) return -EOPNOTSUPP; to both helpers additionally? Can you do that when you apply or do you want me to send a version with that added? Thanks! Christian