Re: [RFC RESEND] xfs: fix up non-directory creation in SGID directories when umask S_IXGRP

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 22 Mar 2022 20:23:22 +1100

On Tue, Mar 22, 2022 at 07:43:20PM +1100, Dave Chinner wrote:
> On Tue, Mar 22, 2022 at 04:04:17PM +0800, Yang Xu wrote:
> > Petr reported a problem that S_ISGID bit was not clean when testing ltp
> > case create09[1] by using umask(077).
> 
> Ok. So what is the failure message from the test?
> 
> When did the test start failing? Is this a recent failure or
> something that has been around for years? If it's recent, what
> commit broke it?

Ok, I went and looked at the test, and it confirmed my suspicion.  I
can't find the patch that introduced this change on lore.kernel.org.
Looks like one of those silent security fixes that nobody gets told
about, gets no real review, has no test cases written for it, etc.

And nobody wrote a test for until August 2021 and that's when people
started to notice broken filesystems.

This is the commit that failed to fix several filesystems:

commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date:   Tue Jul 3 17:10:19 2018 -0700

    Fix up non-directory creation in SGID directories
    
    sgid directories have special semantics, making newly created files in
    the directory belong to the group of the directory, and newly created
    subdirectories will also become sgid.  This is historically used for
    group-shared directories.
    
    But group directories writable by non-group members should not imply
    that such non-group members can magically join the group, so make sure
    to clear the sgid bit on non-directories for non-members (but remember
    that sgid without group execute means "mandatory locking", just to
    confuse things even more).
    
    Reported-by: Jann Horn <jannh@xxxxxxxxxx>
    Cc: Andy Lutomirski <luto@xxxxxxxxxx>
    Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
    Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>

diff --git a/fs/inode.c b/fs/inode.c
index 2c300e981796..8c86c809ca17 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1999,8 +1999,14 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
        inode->i_uid = current_fsuid();
        if (dir && dir->i_mode & S_ISGID) {
                inode->i_gid = dir->i_gid;
+
+               /* Directories are special, and always inherit S_ISGID */
                if (S_ISDIR(mode))
                        mode |= S_ISGID;
+               else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
+                        !in_group_p(inode->i_gid) &&
+                        !capable_wrt_inode_uidgid(dir, CAP_FSETID))
+                       mode &= ~S_ISGID;
        } else
                inode->i_gid = current_fsgid();
        inode->i_mode = mode;

The problem is it takes away mode bits that the VFS passed to us
deep in the VFS inode initialisation done during on-disk inode
initialisation, and it's hidden well away from sight of the
filesystems.

Oh, what a mess - this in_group_p() && capable_wrt_inode_uidgid()
check is splattered all over filesystems in random places to clear
SGID bits. e.g ceph_finish_async_create() is an open coded
inode_init_owner() call. There's special case
code in fuse_set_acl() to clear SGID. There's a special case in
ovl_posix_acl_xattr_set() for ACL xattrs to clear SGID. And so on.

No consistency anywhere - shouldn't the VFS just be stripping the
SGID bit before it passes the mode down to filesystems? It has all
the info it needs - it doesn't need the filesystems to do everything
correctly with the mode and ensuring that they order things like
posix acl setup functions correctly with inode_init_owner() to strip
the SGID bit...

Just strip the SGID bit at the VFS, and then the filesystems can't
get it wrong...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx