Re: [PATCH] ovl: turn of SB_POSIXACL with idmapped layers temporarily

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 07, 2022 at 09:58:47AM +0200, Miklos Szeredi wrote:
> On Wed, 6 Jul 2022 at 15:59, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >
> > This cycle we added support for mounting overlayfs on top of idmapped mounts.
> > Recently I've started looking into potential corner cases when trying to add
> > additional tests and I noticed that reporting for POSIX ACLs is currently wrong
> > when using idmapped layers with overlayfs mounted on top of it.
> >
> > I have sent out an patch that fixes this and makes POSIX ACLs work correctly
> > but the patch is a bit bigger and we're already at -rc5 so I recommend we
> > simply don't raise SB_POSIXACL when idmapped layers are used. Then we can fix
> > the VFS part described below for the next merge window so we can have good
> > exposure in -next.
> >
> > I'm going to give a rather detailed explanation to both the origin of the
> > problem and mention the solution so people know what's going on.
> >
> > Let's assume the user creates the following directory layout and they have a
> > rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would
> > expect files on your host system to be owned. For example, ~/.bashrc for your
> > regular user would be owned by 1000:1000 and /root/.bashrc would be owned by
> > 0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs
> > filesystem.
> >
> > The user chooses to set POSIX ACLs using the setfacl binary granting the user
> > with uid 4 read, write, and execute permissions for their .bashrc file:
> >
> >         setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
> >
> > Now they to expose the whole rootfs to a container using an idmapped mount. So
> > they first create:
> >
> >         mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
> >         mkdir -pv /vol/contpool/ctrover/{over,work}
> >         chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
> >
> > The user now creates an idmapped mount for the rootfs:
> >
> >         mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
> >                                       /var/lib/lxc/c2/rootfs \
> >                                       /vol/contpool/lowermap
> >
> > This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
> > which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at
> > /vol/contpool/lowermap/home/ubuntu/.bashrc.
> >
> > Assume the user wants to expose these idmapped mounts through an overlayfs
> > mount to a container.
> >
> >        mount -t overlay overlay                      \
> >              -o lowerdir=/vol/contpool/lowermap,     \
> >                 upperdir=/vol/contpool/overmap/over, \
> >                 workdir=/vol/contpool/overmap/work   \
> >              /vol/contpool/merge
> >
> > The user can do this in two ways:
> >
> > (1) Mount overlayfs in the initial user namespace and expose it to the
> >     container.
> > (2) Mount overlayfs on top of the idmapped mounts inside of the container's
> >     user namespace.
> >
> > Let's assume the user chooses the (1) option and mounts overlayfs on the host
> > and then changes into a container which uses the idmapping 0:10000000:65536
> > which is the same used for the two idmapped mounts.
> >
> > Now the user tries to retrieve the POSIX ACLs using the getfacl command
> >
> >         getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
> >
> > and to their surprise they see:
> >
> >         # file: vol/contpool/merge/home/ubuntu/.bashrc
> >         # owner: 1000
> >         # group: 1000
> >         user::rw-
> >         user:4294967295:rwx
> >         group::r--
> >         mask::rwx
> >         other::r--
> >
> > indicating the the uid wasn't correctly translated according to the idmapped
> > mount. The problem is how we currently translate POSIX ACLs. Let's inspect the
> > callchain in this example:
> >
> >         idmapped mount /vol/contpool/merge:      0:10000000:65536
> >         caller's idmapping:                      0:10000000:65536
> >         overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
> >
> >         sys_getxattr()
> >         -> path_getxattr()
> >            -> getxattr()
> >               -> do_getxattr()
> >                   |> vfs_getxattr()
> >                   |  -> __vfs_getxattr()
> >                   |     -> handler->get == ovl_posix_acl_xattr_get()
> >                   |        -> ovl_xattr_get()
> >                   |           -> vfs_getxattr()
> >                   |              -> __vfs_getxattr()
> >                   |                 -> handler->get() /* lower filesystem callback */
> >                   |> posix_acl_fix_xattr_to_user()
> >                      {
> >                               4 = make_kuid(&init_user_ns, 4);
> >                               4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
> >                               /* FAILURE */
> >                              -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
> >                      }
> >
> > If the user chooses to use option (2) and mounts overlayfs on top of idmapped
> > mounts inside the container things don't look that much better:
> >
> >         idmapped mount /vol/contpool/merge:      0:10000000:65536
> >         caller's idmapping:                      0:10000000:65536
> >         overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
> >
> >         sys_getxattr()
> >         -> path_getxattr()
> >            -> getxattr()
> >               -> do_getxattr()
> >                   |> vfs_getxattr()
> >                   |  -> __vfs_getxattr()
> >                   |     -> handler->get == ovl_posix_acl_xattr_get()
> >                   |        -> ovl_xattr_get()
> >                   |           -> vfs_getxattr()
> >                   |              -> __vfs_getxattr()
> >                   |                 -> handler->get() /* lower filesystem callback */
> >                   |> posix_acl_fix_xattr_to_user()
> >                      {
> >                               4 = make_kuid(&init_user_ns, 4);
> >                               4 = mapped_kuid_fs(&init_user_ns, 4);
> >                               /* FAILURE */
> >                              -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
> >                      }
> >
> > As is easily seen the problem arises because the idmapping of the lower mount
> > isn't taken into account as all of this happens in do_gexattr(). But
> > do_getxattr() is always called on an overlayfs mount and inode and thus cannot
> > possible take the idmapping of the lower layers into account.
> >
> > This problem is similar for fscaps but there the translation happens as part of
> > vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain:
> >
> >         setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
> >
> > The expected outcome here is that we'll receive the cap_net_raw capability as
> > we are able to map the uid associated with the fscap to 0 within our container.
> > IOW, we want to see 0 as the result of the idmapping translations.
> >
> > If the user chooses option (1) we get the following callchain for fscaps:
> >
> >         idmapped mount /vol/contpool/merge:      0:10000000:65536
> >         caller's idmapping:                      0:10000000:65536
> >         overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
> >
> >         sys_getxattr()
> >         -> path_getxattr()
> >            -> getxattr()
> >               -> do_getxattr()
> >                    -> vfs_getxattr()
> >                       -> xattr_getsecurity()
> >                          -> security_inode_getsecurity()                                       ________________________________
> >                             -> cap_inode_getsecurity()                                         |                              |
> >                                {                                                               V                              |
> >                                         10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
> >                                         10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
> >                                                /* Expected result is 0 and thus that we own the fscap. */                     |
> >                                                0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
> >                                }                                                                                              |
> >                                -> vfs_getxattr_alloc()                                                                        |
> >                                   -> handler->get == ovl_other_xattr_get()                                                    |
> >                                      -> vfs_getxattr()                                                                        |
> >                                         -> xattr_getsecurity()                                                                |
> >                                            -> security_inode_getsecurity()                                                    |
> >                                               -> cap_inode_getsecurity()                                                      |
> >                                                  {                                                                            |
> >                                                                 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
> >                                                          10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
> >                                                          10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
> >                                                          |____________________________________________________________________|
> >                                                  }
> >                                                  -> vfs_getxattr_alloc()
> >                                                     -> handler->get == /* lower filesystem callback */
> >
> > And if the user chooses option (2) we get:
> >
> >         idmapped mount /vol/contpool/merge:      0:10000000:65536
> >         caller's idmapping:                      0:10000000:65536
> >         overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
> >
> >         sys_getxattr()
> >         -> path_getxattr()
> >            -> getxattr()
> >               -> do_getxattr()
> >                    -> vfs_getxattr()
> >                       -> xattr_getsecurity()
> >                          -> security_inode_getsecurity()                                                _______________________________
> >                             -> cap_inode_getsecurity()                                                  |                             |
> >                                {                                                                        V                             |
> >                                        10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
> >                                        10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
> >                                                /* Expected result is 0 and thus that we own the fscap. */                             |
> >                                               0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
> >                                }                                                                                                      |
> >                                -> vfs_getxattr_alloc()                                                                                |
> >                                   -> handler->get == ovl_other_xattr_get()                                                            |
> >                                     |-> vfs_getxattr()                                                                                |
> >                                         -> xattr_getsecurity()                                                                        |
> >                                            -> security_inode_getsecurity()                                                            |
> >                                               -> cap_inode_getsecurity()                                                              |
> >                                                  {                                                                                    |
> >                                                                  0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
> >                                                           10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
> >                                                                  0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
> >                                                                  |____________________________________________________________________|
> >                                                  }
> >                                                  -> vfs_getxattr_alloc()
> >                                                     -> handler->get == /* lower filesystem callback */
> >
> > We can see how the translation happens correctly in those cases as the
> > conversion happens within the vfs_getxattr() helper.
> >
> > For POSIX ACLs we need to do something similar. However, in contrast to fscaps
> > we cannot apply the fix directly to the kernel internal posix acl data
> > structure as this would alter the cached values and would also require a rework
> > of how we currently deal with POSIX ACLs in general which almost never take the
> > filesystem idmapping into account (the noteable exception being FUSE but even
> > there the implementation is special) and instead retrieve the raw values based
> > on the initial idmapping.
> >
> > The correct values are then generated right before returning to userspace. The
> > fix for this is to move taking the mount's idmapping into account directly in
> > vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user().
> >
> > To this end we simply move the idmapped mount translation into a separate step
> > performed in vfs_{g,s}etxattr() instead of in
> > posix_acl_fix_xattr_{from,to}_user().
> >
> > To see how this fixes things let's go back to the original example. Assume the
> > user chose option (1) and mounted overlayfs on top of idmapped mounts on the
> > host:
> >
> >         idmapped mount /vol/contpool/merge:      0:10000000:65536
> >         caller's idmapping:                      0:10000000:65536
> >         overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
> >
> >         sys_getxattr()
> >         -> path_getxattr()
> >            -> getxattr()
> >               -> do_getxattr()
> >                   |> vfs_getxattr()
> >                   |  |> __vfs_getxattr()
> >                   |  |  -> handler->get == ovl_posix_acl_xattr_get()
> >                   |  |     -> ovl_xattr_get()
> >                   |  |        -> vfs_getxattr()
> >                   |  |           |> __vfs_getxattr()
> >                   |  |           |  -> handler->get() /* lower filesystem callback */
> >                   |  |           |> posix_acl_getxattr_idmapped_mnt()
> >                   |  |              {
> >                   |  |                              4 = make_kuid(&init_user_ns, 4);
> >                   |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
> >                   |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
> >                   |  |                       |_______________________
> >                   |  |              }                               |
> >                   |  |                                              |
> >                   |  |> posix_acl_getxattr_idmapped_mnt()           |
> >                   |     {                                           |
> >                   |                                                 V
> >                   |             10000004 = make_kuid(&init_user_ns, 10000004);
> >                   |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
> >                   |             10000004 = from_kuid(&init_user_ns, 10000004);
> >                   |     }       |_________________________________________________
> >                   |                                                              |
> >                   |                                                              |
> >                   |> posix_acl_fix_xattr_to_user()                               |
> >                      {                                                           V
> >                                  10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
> >                                         /* SUCCESS */
> >                                         4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
> >                      }
> >
> > And similarly if the user chooses option (1) and mounted overayfs on top of
> > idmapped mounts inside the container:
> >
> >         idmapped mount /vol/contpool/merge:      0:10000000:65536
> >         caller's idmapping:                      0:10000000:65536
> >         overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
> >
> >         sys_getxattr()
> >         -> path_getxattr()
> >            -> getxattr()
> >               -> do_getxattr()
> >                   |> vfs_getxattr()
> >                   |  |> __vfs_getxattr()
> >                   |  |  -> handler->get == ovl_posix_acl_xattr_get()
> >                   |  |     -> ovl_xattr_get()
> >                   |  |        -> vfs_getxattr()
> >                   |  |           |> __vfs_getxattr()
> >                   |  |           |  -> handler->get() /* lower filesystem callback */
> >                   |  |           |> posix_acl_getxattr_idmapped_mnt()
> >                   |  |              {
> >                   |  |                              4 = make_kuid(&init_user_ns, 4);
> >                   |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
> >                   |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
> >                   |  |                       |_______________________
> >                   |  |              }                               |
> >                   |  |                                              |
> >                   |  |> posix_acl_getxattr_idmapped_mnt()           |
> >                   |     {                                           V
> >                   |             10000004 = make_kuid(&init_user_ns, 10000004);
> >                   |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
> >                   |             10000004 = from_kuid(0(&init_user_ns, 10000004);
> >                   |             |_________________________________________________
> >                   |     }                                                        |
> >                   |                                                              |
> >                   |> posix_acl_fix_xattr_to_user()                               |
> >                      {                                                           V
> >                                  10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
> >                                         /* SUCCESS */
> >                                         4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
> >                      }
> >
> > The last remaining problem we need to fix here is ovl_get_acl(). During
> > ovl_permission() overlayfs will call:
> >
> >         ovl_permission()
> >         -> generic_permission()
> >            -> acl_permission_check()
> >               -> check_acl()
> >                  -> get_acl()
> >                     -> inode->i_op->get_acl() == ovl_get_acl()
> >                         > get_acl() /* on the underlying filesystem)
> >                           ->inode->i_op->get_acl() == /*lower filesystem callback */
> >                  -> posix_acl_permission()
> >
> > passing through the get_acl request to the underlying filesystem. This will
> > retrieve the acls stored in the lower filesystem without taking the idmapping
> > of the underlying mount into account as this would mean altering the cached
> > values for the lower filesystem. The simple solution is to have ovl_get_acl()
> > simply duplicate the ACLs, update the values according to the idmapped mount
> > and return it to acl_permission_check() so it can be used in
> > posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be released
> > right after.
> >
> > Link: https://github.com/brauner/mount-idmapped/issues/9
> > Cc: Seth Forshee <sforshee@xxxxxxxxxxxxxxxx>
> > Cc: Amir Goldstein <amir73il@xxxxxxxxx>
> > Cc: Vivek Goyal <vgoyal@xxxxxxxxxx>
> > Cc: Christoph Hellwig <hch@xxxxxx>
> > Cc: Aleksa Sarai <cyphar@xxxxxxxxxx>
> > Cc: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> > Cc: linux-unionfs@xxxxxxxxxxxxxxx
> > Signed-off-by: Christian Brauner (Microsoft) <brauner@xxxxxxxxxx>
> > ---
> > Hey Miklos,
> >
> > I describe in detail how I'm going to fix this with the series I intend
> > to get ready for the next merge window in the commit message.
> >
> > I would just turn off POSIX ACLs until then. Would you be ok with that
> > and route this to Linus this or next week?
> 
> Sounds good.
> 
> However I don't think clearing SB_POSIXACL will do that.
> 
> Maybe denying the operation in ovl_posix_acl_xattr_{get,set}() is the
> right way to achieve the above?

Hm, removing SB_POSIXACL in my tests fixed that completely. But we can
add an additional check:

	if (!IS_POSIXACL(inode))
		return -EOPNOTSUPP;

to both helpers additionally? Can you do that when you apply or do you
want me to send a version with that added?

Thanks!
Christian



[Index of Archives]     [Linux Filesystems Devel]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux