On Tue, Mar 22, 2022 at 07:43:20PM +1100, Dave Chinner wrote: > On Tue, Mar 22, 2022 at 04:04:17PM +0800, Yang Xu wrote: > > Petr reported a problem that S_ISGID bit was not clean when testing ltp > > case create09[1] by using umask(077). > > Ok. So what is the failure message from the test? > > When did the test start failing? Is this a recent failure or > something that has been around for years? If it's recent, what > commit broke it? Ok, I went and looked at the test, and it confirmed my suspicion. I can't find the patch that introduced this change on lore.kernel.org. Looks like one of those silent security fixes that nobody gets told about, gets no real review, has no test cases written for it, etc. And nobody wrote a test for until August 2021 and that's when people started to notice broken filesystems. This is the commit that failed to fix several filesystems: commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7 Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Tue Jul 3 17:10:19 2018 -0700 Fix up non-directory creation in SGID directories sgid directories have special semantics, making newly created files in the directory belong to the group of the directory, and newly created subdirectories will also become sgid. This is historically used for group-shared directories. But group directories writable by non-group members should not imply that such non-group members can magically join the group, so make sure to clear the sgid bit on non-directories for non-members (but remember that sgid without group execute means "mandatory locking", just to confuse things even more). Reported-by: Jann Horn <jannh@xxxxxxxxxx> Cc: Andy Lutomirski <luto@xxxxxxxxxx> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> diff --git a/fs/inode.c b/fs/inode.c index 2c300e981796..8c86c809ca17 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1999,8 +1999,14 @@ void inode_init_owner(struct inode *inode, const struct inode *dir, inode->i_uid = current_fsuid(); if (dir && dir->i_mode & S_ISGID) { inode->i_gid = dir->i_gid; + + /* Directories are special, and always inherit S_ISGID */ if (S_ISDIR(mode)) mode |= S_ISGID; + else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) && + !in_group_p(inode->i_gid) && + !capable_wrt_inode_uidgid(dir, CAP_FSETID)) + mode &= ~S_ISGID; } else inode->i_gid = current_fsgid(); inode->i_mode = mode; The problem is it takes away mode bits that the VFS passed to us deep in the VFS inode initialisation done during on-disk inode initialisation, and it's hidden well away from sight of the filesystems. Oh, what a mess - this in_group_p() && capable_wrt_inode_uidgid() check is splattered all over filesystems in random places to clear SGID bits. e.g ceph_finish_async_create() is an open coded inode_init_owner() call. There's special case code in fuse_set_acl() to clear SGID. There's a special case in ovl_posix_acl_xattr_set() for ACL xattrs to clear SGID. And so on. No consistency anywhere - shouldn't the VFS just be stripping the SGID bit before it passes the mode down to filesystems? It has all the info it needs - it doesn't need the filesystems to do everything correctly with the mode and ensuring that they order things like posix acl setup functions correctly with inode_init_owner() to strip the SGID bit... Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx