Re: [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem

"xuyang2018.jy@xxxxxxxxxxx" <xuyang2018.jy@xxxxxxxxxxx> · Thu, 31 Mar 2022 09:30:17 +0000



on 2022/3/31 0:44, Darrick J. Wong wrote:
> On Wed, Mar 30, 2022 at 12:44:19PM +0200, Christian Brauner wrote:
>> On Wed, Mar 30, 2022 at 09:10:59AM +1100, Dave Chinner wrote:
>>> On Tue, Mar 29, 2022 at 07:12:11AM -0400, Jeff Layton wrote:
>>>> On Mon, 2022-03-28 at 17:56 +0800, Yang Xu wrote:
>>>>> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
>>>>> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
>>>>> firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
>>>>> S_ISGID clear especially umask with S_IXGRP.
>>>>>
>>>>> Vfs has all the info it needs - it doesn't need the filesystems to do everything
>>>>> correctly with the mode and ensuring that they order things like posix acl setup
>>>>> functions correctly with inode_init_owner() to strip the SGID bit.
>>>>>
>>>>> Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.
>>>>>
>>>>> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
>>>>> this api may change mode by using umask but S_ISGID clear isn't related to
>>>>> SB_POSIXACL flag.
>>>>>
>>>>> Suggested-by: Dave Chinner<david@xxxxxxxxxxxxx>
>>>>> Signed-off-by: Yang Xu<xuyang2018.jy@xxxxxxxxxxx>
>>>>> ---
>>>>>   fs/inode.c | 4 ----
>>>>>   fs/namei.c | 7 +++++--
>>>>>   2 files changed, 5 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/fs/inode.c b/fs/inode.c
>>>>> index 1f964e7f9698..a2dd71c2437e 100644
>>>>> --- a/fs/inode.c
>>>>> +++ b/fs/inode.c
>>>>> @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>>>>   		/* Directories are special, and always inherit S_ISGID */
>>>>>   		if (S_ISDIR(mode))
>>>>>   			mode |= S_ISGID;
>>>>> -		else if ((mode&  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&&
>>>>> -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir))&&
>>>>> -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
>>>>> -			mode&= ~S_ISGID;
>>>>>   	} else
>>>>>   		inode_fsgid_set(inode, mnt_userns);
>>>>>   	inode->i_mode = mode;
>>>>> diff --git a/fs/namei.c b/fs/namei.c
>>>>> index 3f1829b3ab5b..e68a99e0ac96 100644
>>>>> --- a/fs/namei.c
>>>>> +++ b/fs/namei.c
>>>>> @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>>>>>   	if (open_flag&  O_CREAT) {
>>>>>   		if (open_flag&  O_EXCL)
>>>>>   			open_flag&= ~O_TRUNC;
>>>>> +		inode_sgid_strip(mnt_userns, dir->d_inode,&mode);
>>>>>   		if (!IS_POSIXACL(dir->d_inode))
>>>>>   			mode&= ~current_umask();
>>>>>   		if (likely(got_write))
>>>>> @@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>>>>>   	child = d_alloc(dentry,&slash_name);
>>>>>   	if (unlikely(!child))
>>>>>   		goto out_err;
>>>>> +	inode_sgid_strip(mnt_userns, dir,&mode);
>>>>> +
>>>>>   	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>>>>>   	if (error)
>>>>>   		goto out_err;
>>>>> @@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>>>>>   	error = PTR_ERR(dentry);
>>>>>   	if (IS_ERR(dentry))
>>>>>   		goto out1;
>>>>> -
>>>>> +	mnt_userns = mnt_user_ns(path.mnt);
>>>>> +	inode_sgid_strip(mnt_userns, path.dentry->d_inode,&mode);
>>>>>   	if (!IS_POSIXACL(path.dentry->d_inode))
>>>>>   		mode&= ~current_umask();
>>>>>   	error = security_path_mknod(&path, dentry, mode, dev);
>>>>>   	if (error)
>>>>>   		goto out2;
>>>>>
>>>>> -	mnt_userns = mnt_user_ns(path.mnt);
>>>>>   	switch (mode&  S_IFMT) {
>>>>>   		case 0: case S_IFREG:
>>>>>   			error = vfs_create(mnt_userns, path.dentry->d_inode,
>>>>
>>>> I haven't gone over this in detail, but have you tested this with NFS at
>>>> all?
>>>>
>>>> IIRC, NFS has to leave setuid/gid stripping to the server, so I wonder
>>>> if this may end up running afoul of that by forcing the client to try
>>>> and strip these bits.
>>>
>>> All it means is that the mode passed to the NFS server for the
>>> create already has the SGID bit stripped from it. It means the
>>> client is no longer reliant on the server behaving correctly to
>>> close this security hole.
>>>
>>> That is, failing to strip the SGID bit appropriately in the local
>>> context is a security issue. Hence local machine security requires
>>> that the NFS client should try to strip the SGID to defend against
>>> buggy/unfixed servers that fail to strip it appropriately and
>>> thereby continute to expose the local machine to this SGID security
>>> issue.
>>>
>>> That's the problem here - the SGID stripping in inode_init_owner()
>>> is not documented, wasn't reviewed, doesn't work correctly
>>> across all filesystems and leaves nasty security landmines when the VFS
>>> create mode and the stripped inode mode differ.
>>>
>>> Various filesystems have workarounds, partial fixes or no fixes for
>>> these issues and landmines. Hence we have a situation where we are
>>> playing whack-a-mole to discover and slap band-aids over all the
>>> places that inode_init_owner() based stripping does not work
>>> correctly.
>>>
>>> In XFS, this meant the problem was not orginally fixed by the
>>> silent, unreviewed change to inode_init_owner() in 2018
>>> because it didn't call inode_init_owner() at all. So 4 years after
>>> the bug was "fixed" and the CVE released, we are still exposed to
>>> the bug because *no filesystem people knew about it* and *nobody wrote a
>>> regression test* to check that the probelm was fixed and stayed
>>> fixed.
>>>
>>> And now that XFS does call inode_init_owner(), we've subsequently
>>> discovered that XFS still fail when default acls are enabled because
>>> we create the ACL from the mode passed from the VFS, not the
>>> stripped mode that results from inode_init_owner() being called.
>>>
>>> See what I mean about landmines?
>>>
>>> The fact is this: regardless of which filesystem is in use, failure
>>> to strip the SGID correctly is considered a security failure that
>>> needs to be fixed. The current VFS infrastructure requires the
>>> filesystem to do everything right and not step on any landmines to
>>> strip the SGID bit, when in fact it can easily be done at the VFS
>>> and the filesystems then don't even need to be aware that the SGID
>>> needs to be (or has been stripped) by the operation the user asked
>>> to be done.
>>>
>>> We need the architecture to be *secure by design*, not tacked onto
>>> the side like it is now.  We need to stop trying to dance around
>>> these landmines - it is *not working* and we are blowing our own
>>> feet off repeatedly. This hurts a lot (especially in distro land)
>>> so we need to take the responsibility for stripping SGID properly
>>> away from the filesystems and put it where it belongs: in the VFS.
>>
>> I agree. When I added tests for set*id stripping to xfstests for the
>> sake of getting complete vfs coverage of idmapped mounts in generic/633
>> I immediately found bugs. Once I made the testsuite useable by all
>> filesystems we started seeing more.
>>
>> I think we should add and use the new proposed stripping helper in the
>> vfs - albeit with a slightly changed api and also use it in
>> inode_init_owner(). While it is a delicate change in the worst case we
>> end up removing additional privileges that's an acceptable regression
>> risk to take.
>
> And if it's not too much trouble, can we add an fstest to encode our
> current expectations about how setgid inheritance works?  I would really
> like to reduce the need for historic setgid behavior spelunking. ;)
I have sent two patches to increase the idmapped mounts coverage for 
S_ISGID in xfstests.

Best Regards
Yang Xu

>
> --D