Re: [PATCH 2/2] ovl: remove privs in ovl_fallocate()

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 4 Oct 2022 16:12:59 +0200

On Tue, Oct 04, 2022 at 05:01:06PM +0300, Amir Goldstein wrote:
> On Tue, Oct 4, 2022 at 1:59 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >
> > On Mon, Oct 03, 2022 at 03:30:40PM +0300, Amir Goldstein wrote:
> > > Underlying fs doesn't remove privs because fallocate is called with
> > > privileged mounter credentials.
> > >
> > > This fixes some failure in fstests generic/683..687.
> > >
> > > Fixes: aab8848cee5e ("ovl: add ovl_fallocate()")
> > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> > > ---
> > >  fs/overlayfs/file.c | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index c8308da8909a..e90ac5376456 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -517,9 +517,16 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
> > >       const struct cred *old_cred;
> > >       int ret;
> > >
> > > +     inode_lock(inode);
> > > +     /* Update mode */
> > > +     ovl_copyattr(inode);
> > > +     ret = file_remove_privs(file);
> >
> > First, thank you for picking this up!
> >
> > Let me analyze generic/683 failure of Test1 to see why you still see
> > failures in this test:
> >
> > echo "Test 1 - qa_user, non-exec file $verb"
> > setup_testfile
> > chmod a+rws $junk_file
> > commit_and_check "$qa_user" "$verb" 64k 64k
> >
> > So this creates a file with 6666 permissions. While the file has the
> > S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. This is
> > important in a little bit.
> >
> > On a regular filesystem like xfs what will happen is:
> >
> > sys_fallocate()
> > -> vfs_fallocate()
> >    -> xfs_file_fallocate()
> >       -> file_modified()
> >          -> __file_remove_privs()
> >             -> dentry_needs_remove_privs()
> >                -> should_remove_suid()
> >             -> __remove_privs()
> >                newattrs.ia_valid = ATTR_FORCE | kill;
> >                -> notify_change()
> >
> > In should_remove_suid() we can see that ATTR_KILL_SUID is raised
> > unconditionally because the file in the test has S_ISUID set.
> >
> > But we also see that ATTR_KILL_SGID won't be set because while the file
> > is S_ISGID it is not S_IXGRP (see above) which is a condition for
> > ATTR_KILL_SGID being raised.
> >
> > So by the time we call notify_change() we have attr->ia_valid set to
> > ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
> > ATTR_KILL_SUID is set and does:
> >
> > ia_valid = attr->ia_valid |= ATTR_MODE
> > attr->ia_mode = (inode->i_mode & ~S_ISUID);
> >
> > which means that when we call setattr_copy() later we will definitely
> > update inode->i_mode. Note that attr->ia_mode still contain S_ISGID.
> >
> > Now we call into the filesystem's ->setattr() inode operation which will end up
> > calling setattr_copy(). Since ATTR_MODE is set we will hit:
> >
> > if (ia_valid & ATTR_MODE) {
> >         umode_t mode = attr->ia_mode;
> >         vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
> >         if (!vfsgid_in_group_p(vfsgid) &&
> >             !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
> >                 mode &= ~S_ISGID;
> >         inode->i_mode = mode;
> > }
> >
> 
> Can you think of a reason why the above should not be done
> in notify_change() before even calling to ->setattr()?
> 
> Although, it wouldn't help because ovl_setattr() does:
> 
>     if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID))
>         attr->ia_valid &= ~ATTR_MODE;
> 
> > and since the caller in the test is neither capable nor in the group of the
> > inode the S_ISGID bit is stripped.
> >
> > But now contrast this with overlayfs even after your changes. When
> > ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs()
> > and calls ovl_do_notify_change() then we are doing this under the
> > mounter's creds and so the S_ISGID bit is retained:
> >
> > sys_fallocate()
> > -> vfs_fallocate()
> >    -> ovl_fallocate()
> >       -> file_remove_privs()
> >          -> dentry_needs_remove_privs()
> >             -> should_remove_suid()
> >          -> __remove_privs()
> >             newattrs.ia_valid = attr_force | kill;
> >             -> notify_change()
> >                -> ovl_setattr()
> >                   // TAKE ON MOUNTER'S CREDS
> >                   -> ovl_do_notify_change()
> >                   // GIVE UP MOUNTER'S CREDS
> >      // TAKE ON MOUNTER'S CREDS
> >      -> vfs_fallocate()
> >         -> xfs_file_fallocate()
> >            -> file_modified()
> >               -> __file_remove_privs()
> >                  -> dentry_needs_remove_privs()
> >                     -> should_remove_suid()
> >                  -> __remove_privs()
> >                     newattrs.ia_valid = attr_force | kill;
> >                     -> notify_change()
> 
> The model in overlayfs is that security is checked twice
> once on overlay inode with caller creds and once again
> on xfs inode with mounter creds. Either of these checks
> could result in clearing SUID/SGID bits.

Yep.

> 
> In the call stack above, the outer should_remove_suid()
> with caller creds sets ATTR_KILL_SUID and then the outer
> notify_change() clears SUID and sets ATTR_MODE,

Yes.

> but ovl_setattr() clears ATTR_MODE and then the inner
> notify_change() re-clears SUID and sets ATTR_MODE again.

Yes.

> 
> If the outer notify_change() would have checked the in_group_p()
> condition, clear SGID and set a flag ATTR_KILL_SGID_FORCE
> then the inner notify_change() would see this flag and re-clear
> SGID bit, just the same as it does with SUID bit in the stack stace
> above.
> 
> Is this making any sense?

What I kept thinking was sm along the lines of:

diff --git a/fs/inode.c b/fs/inode.c
index ba1de23c13c1..e62a564201b7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1968,8 +1968,12 @@ int should_remove_suid(struct dentry *dentry)
         * sgid without any exec bits is just a mandatory locking mark; leave
         * it alone.  If some exec bits are set, it's a real sgid; kill it.
         */
-       if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
-               kill |= ATTR_KILL_SGID;
+       if (unlikely(mode & S_ISGID)) {
+               if ((mode & S_IXGRP) ||
+                   (!vfsgid_in_group_p(vfsgid) &&
+                    !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)))
+                       kill |= ATTR_KILL_SGID;
+       }

        if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
                return kill;

mandatory locks have been removed as well so that remark seems pointless
as well?