on 2022/3/22 17:23, Dave Chinner wrote: > On Tue, Mar 22, 2022 at 07:43:20PM +1100, Dave Chinner wrote: >> On Tue, Mar 22, 2022 at 04:04:17PM +0800, Yang Xu wrote: >>> Petr reported a problem that S_ISGID bit was not clean when testing ltp >>> case create09[1] by using umask(077). >> >> Ok. So what is the failure message from the test? >> >> When did the test start failing? Is this a recent failure or >> something that has been around for years? If it's recent, what >> commit broke it? > > Ok, I went and looked at the test, and it confirmed my suspicion. I > can't find the patch that introduced this change on lore.kernel.org. > Looks like one of those silent security fixes that nobody gets told > about, gets no real review, has no test cases written for it, etc. > > And nobody wrote a test for until August 2021 and that's when people > started to notice broken filesystems. > > This is the commit that failed to fix several filesystems: > > commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7 > Author: Linus Torvalds<torvalds@xxxxxxxxxxxxxxxxxxxx> > Date: Tue Jul 3 17:10:19 2018 -0700 > > Fix up non-directory creation in SGID directories > > sgid directories have special semantics, making newly created files in > the directory belong to the group of the directory, and newly created > subdirectories will also become sgid. This is historically used for > group-shared directories. > > But group directories writable by non-group members should not imply > that such non-group members can magically join the group, so make sure > to clear the sgid bit on non-directories for non-members (but remember > that sgid without group execute means "mandatory locking", just to > confuse things even more). > > Reported-by: Jann Horn<jannh@xxxxxxxxxx> > Cc: Andy Lutomirski<luto@xxxxxxxxxx> > Cc: Al Viro<viro@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Linus Torvalds<torvalds@xxxxxxxxxxxxxxxxxxxx> > > diff --git a/fs/inode.c b/fs/inode.c > index 2c300e981796..8c86c809ca17 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1999,8 +1999,14 @@ void inode_init_owner(struct inode *inode, const struct inode *dir, > inode->i_uid = current_fsuid(); > if (dir&& dir->i_mode& S_ISGID) { > inode->i_gid = dir->i_gid; > + > + /* Directories are special, and always inherit S_ISGID */ > if (S_ISDIR(mode)) > mode |= S_ISGID; > + else if ((mode& (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&& > + !in_group_p(inode->i_gid)&& > + !capable_wrt_inode_uidgid(dir, CAP_FSETID)) > + mode&= ~S_ISGID; > } else > inode->i_gid = current_fsgid(); > inode->i_mode = mode; > > The problem is it takes away mode bits that the VFS passed to us > deep in the VFS inode initialisation done during on-disk inode > initialisation, and it's hidden well away from sight of the > filesystems. > > Oh, what a mess - this in_group_p()&& capable_wrt_inode_uidgid() > check is splattered all over filesystems in random places to clear > SGID bits. e.g ceph_finish_async_create() is an open coded > inode_init_owner() call. There's special case > code in fuse_set_acl() to clear SGID. There's a special case in > ovl_posix_acl_xattr_set() for ACL xattrs to clear SGID. And so on. > > No consistency anywhere - shouldn't the VFS just be stripping the > SGID bit before it passes the mode down to filesystems? It has all > the info it needs - it doesn't need the filesystems to do everything > correctly with the mode and ensuring that they order things like > posix acl setup functions correctly with inode_init_owner() to strip > the SGID bit... > > Just strip the SGID bit at the VFS, and then the filesystems can't > get it wrong... Thanks for your reply. Sound reasonable, but I am not sure I have the ability to do this. Will try ... Best Regards Yang Xu > > Cheers, > > Dave.