On Wed, Mar 30, 2022 at 12:44:19PM +0200, Christian Brauner wrote: > On Wed, Mar 30, 2022 at 09:10:59AM +1100, Dave Chinner wrote: > > On Tue, Mar 29, 2022 at 07:12:11AM -0400, Jeff Layton wrote: > > > On Mon, 2022-03-28 at 17:56 +0800, Yang Xu wrote: > > > > Currently, vfs only passes mode argument to filesystem, then use inode_init_owner() > > > > to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner > > > > firstly, then posxi acl setup, but xfs uses the contrary order. It will affect > > > > S_ISGID clear especially umask with S_IXGRP. > > > > > > > > Vfs has all the info it needs - it doesn't need the filesystems to do everything > > > > correctly with the mode and ensuring that they order things like posix acl setup > > > > functions correctly with inode_init_owner() to strip the SGID bit. > > > > > > > > Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong. > > > > > > > > Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because > > > > this api may change mode by using umask but S_ISGID clear isn't related to > > > > SB_POSIXACL flag. > > > > > > > > Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx> > > > > Signed-off-by: Yang Xu <xuyang2018.jy@xxxxxxxxxxx> > > > > --- > > > > fs/inode.c | 4 ---- > > > > fs/namei.c | 7 +++++-- > > > > 2 files changed, 5 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > > index 1f964e7f9698..a2dd71c2437e 100644 > > > > --- a/fs/inode.c > > > > +++ b/fs/inode.c > > > > @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode, > > > > /* Directories are special, and always inherit S_ISGID */ > > > > if (S_ISDIR(mode)) > > > > mode |= S_ISGID; > > > > - else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) && > > > > - !in_group_p(i_gid_into_mnt(mnt_userns, dir)) && > > > > - !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID)) > > > > - mode &= ~S_ISGID; > > > > } else > > > > inode_fsgid_set(inode, mnt_userns); > > > > inode->i_mode = mode; > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > index 3f1829b3ab5b..e68a99e0ac96 100644 > > > > --- a/fs/namei.c > > > > +++ b/fs/namei.c > > > > @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > > > if (open_flag & O_CREAT) { > > > > if (open_flag & O_EXCL) > > > > open_flag &= ~O_TRUNC; > > > > + inode_sgid_strip(mnt_userns, dir->d_inode, &mode); > > > > if (!IS_POSIXACL(dir->d_inode)) > > > > mode &= ~current_umask(); > > > > if (likely(got_write)) > > > > @@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns, > > > > child = d_alloc(dentry, &slash_name); > > > > if (unlikely(!child)) > > > > goto out_err; > > > > + inode_sgid_strip(mnt_userns, dir, &mode); > > > > + > > > > error = dir->i_op->tmpfile(mnt_userns, dir, child, mode); > > > > if (error) > > > > goto out_err; > > > > @@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode, > > > > error = PTR_ERR(dentry); > > > > if (IS_ERR(dentry)) > > > > goto out1; > > > > - > > > > + mnt_userns = mnt_user_ns(path.mnt); > > > > + inode_sgid_strip(mnt_userns, path.dentry->d_inode, &mode); > > > > if (!IS_POSIXACL(path.dentry->d_inode)) > > > > mode &= ~current_umask(); > > > > error = security_path_mknod(&path, dentry, mode, dev); > > > > if (error) > > > > goto out2; > > > > > > > > - mnt_userns = mnt_user_ns(path.mnt); > > > > switch (mode & S_IFMT) { > > > > case 0: case S_IFREG: > > > > error = vfs_create(mnt_userns, path.dentry->d_inode, > > > > > > I haven't gone over this in detail, but have you tested this with NFS at > > > all? > > > > > > IIRC, NFS has to leave setuid/gid stripping to the server, so I wonder > > > if this may end up running afoul of that by forcing the client to try > > > and strip these bits. > > > > All it means is that the mode passed to the NFS server for the > > create already has the SGID bit stripped from it. It means the > > client is no longer reliant on the server behaving correctly to > > close this security hole. > > > > That is, failing to strip the SGID bit appropriately in the local > > context is a security issue. Hence local machine security requires > > that the NFS client should try to strip the SGID to defend against > > buggy/unfixed servers that fail to strip it appropriately and > > thereby continute to expose the local machine to this SGID security > > issue. > > > > That's the problem here - the SGID stripping in inode_init_owner() > > is not documented, wasn't reviewed, doesn't work correctly > > across all filesystems and leaves nasty security landmines when the VFS > > create mode and the stripped inode mode differ. > > > > Various filesystems have workarounds, partial fixes or no fixes for > > these issues and landmines. Hence we have a situation where we are > > playing whack-a-mole to discover and slap band-aids over all the > > places that inode_init_owner() based stripping does not work > > correctly. > > > > In XFS, this meant the problem was not orginally fixed by the > > silent, unreviewed change to inode_init_owner() in 2018 > > because it didn't call inode_init_owner() at all. So 4 years after > > the bug was "fixed" and the CVE released, we are still exposed to > > the bug because *no filesystem people knew about it* and *nobody wrote a > > regression test* to check that the probelm was fixed and stayed > > fixed. > > > > And now that XFS does call inode_init_owner(), we've subsequently > > discovered that XFS still fail when default acls are enabled because > > we create the ACL from the mode passed from the VFS, not the > > stripped mode that results from inode_init_owner() being called. > > > > See what I mean about landmines? > > > > The fact is this: regardless of which filesystem is in use, failure > > to strip the SGID correctly is considered a security failure that > > needs to be fixed. The current VFS infrastructure requires the > > filesystem to do everything right and not step on any landmines to > > strip the SGID bit, when in fact it can easily be done at the VFS > > and the filesystems then don't even need to be aware that the SGID > > needs to be (or has been stripped) by the operation the user asked > > to be done. > > > > We need the architecture to be *secure by design*, not tacked onto > > the side like it is now. We need to stop trying to dance around > > these landmines - it is *not working* and we are blowing our own > > feet off repeatedly. This hurts a lot (especially in distro land) > > so we need to take the responsibility for stripping SGID properly > > away from the filesystems and put it where it belongs: in the VFS. > > I agree. When I added tests for set*id stripping to xfstests for the > sake of getting complete vfs coverage of idmapped mounts in generic/633 > I immediately found bugs. Once I made the testsuite useable by all > filesystems we started seeing more. > > I think we should add and use the new proposed stripping helper in the > vfs - albeit with a slightly changed api and also use it in > inode_init_owner(). While it is a delicate change in the worst case we > end up removing additional privileges that's an acceptable regression > risk to take. And if it's not too much trouble, can we add an fstest to encode our current expectations about how setgid inheritance works? I would really like to reduce the need for historic setgid behavior spelunking. ;) --D