Re: [PATCH v2 2/3] vfs: strip file's S_ISGID mode on vfs instead of on underlying filesystem

Christian Brauner <brauner@xxxxxxxxxx> · Fri, 15 Apr 2022 16:03:46 +0200

On Fri, Apr 15, 2022 at 09:06:25AM +0000, xuyang2018.jy@xxxxxxxxxxx wrote:
> on 2022/4/15 12:15, Yang Xu wrote:
> > on 2022/4/14 20:45, Christian Brauner wrote:
> >> On Thu, Apr 14, 2022 at 03:57:18PM +0800, Yang Xu wrote:
> >>> Currently, vfs only passes mode argument to filesystem, then use
> >>> inode_init_owner()
> >>> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call
> >>> inode_init_owner
> >>> firstly, then posxi acl setup, but xfs uses the contrary order. It
> >>> will affect
> >>> S_ISGID clear especially we filter S_IXGRP by umask or acl.
> >>>
> >>> Regardless of which filesystem is in use, failure to strip the SGID
> >>> correctly is
> >>> considered a security failure that needs to be fixed. The current VFS
> >>> infrastructure
> >>> requires the filesystem to do everything right and not step on any
> >>> landmines to
> >>> strip the SGID bit, when in fact it can easily be done at the VFS and
> >>> the filesystems
> >>> then don't even need to be aware that the SGID needs to be (or has
> >>> been stripped) by
> >>> the operation the user asked to be done.
> >>>
> >>> Vfs has all the info it needs - it doesn't need the filesystems to do
> >>> everything
> >>> correctly with the mode and ensuring that they order things like
> >>> posix acl setup
> >>> functions correctly with inode_init_owner() to strip the SGID bit.
> >>>
> >>> Just strip the SGID bit at the VFS, and then the filesystems can't
> >>> get it wrong.
> >>>
> >>> Also, the inode_sgid_strip() api should be used before IS_POSIXACL()
> >>> because
> >>> this api may change mode.
> >>>
> >>> Only the following places use inode_init_owner
> >>> "hugetlbfs/inode.c:846: inode_init_owner(&init_user_ns, inode, dir,
> >>> mode);
> >>> nilfs2/inode.c:354: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> zonefs/super.c:1289: inode_init_owner(&init_user_ns, inode, parent,
> >>> S_IFDIR | 0555);
> >>> reiserfs/namei.c:619: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> jfs/jfs_inode.c:67: inode_init_owner(&init_user_ns, inode, parent,
> >>> mode);
> >>> f2fs/namei.c:50: inode_init_owner(mnt_userns, inode, dir, mode);
> >>> ext2/ialloc.c:549: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> overlayfs/dir.c:643: inode_init_owner(&init_user_ns, inode,
> >>> dentry->d_parent->d_inode, mode);
> >>> ufs/ialloc.c:292: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> ntfs3/inode.c:1283: inode_init_owner(mnt_userns, inode, dir, mode);
> >>> ramfs/inode.c:64: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> 9p/vfs_inode.c:263: inode_init_owner(&init_user_ns, inode, NULL, mode);
> >>> btrfs/tests/btrfs-tests.c:65: inode_init_owner(&init_user_ns, inode,
> >>> NULL, S_IFREG);
> >>> btrfs/inode.c:6215: inode_init_owner(mnt_userns, inode, dir, mode);
> >>> sysv/ialloc.c:166: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> omfs/inode.c:51: inode_init_owner(&init_user_ns, inode, NULL, mode);
> >>> ubifs/dir.c:97: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> udf/ialloc.c:108: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> ext4/ialloc.c:979: inode_init_owner(mnt_userns, inode, dir, mode);
> >>> hfsplus/inode.c:393: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> xfs/xfs_inode.c:840: inode_init_owner(mnt_userns, inode, dir, mode);
> >>> ocfs2/dlmfs/dlmfs.c:331: inode_init_owner(&init_user_ns, inode, NULL,
> >>> mode);
> >>> ocfs2/dlmfs/dlmfs.c:354: inode_init_owner(&init_user_ns, inode,
> >>> parent, mode);
> >>> ocfs2/namei.c:200: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> minix/bitmap.c:255: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> bfs/dir.c:99: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>> "
> >>
> >> For completeness sake, there's also spufs which doesn't really go
> >> through the regular VFS callpath because it has separate system calls
> >> like:
> >>
> >> SYSCALL_DEFINE4(spu_create, const char __user *, name, unsigned int,
> >> flags,
> >> umode_t, mode, int, neighbor_fd)
> >>
> >> but looking through the code it only allows the creation of
> >> directories and only
> >> allows bits in 0777.
> > IMO, this fs also doesn't use inode_init_owner, so it should be not
> > affected. We add indo_sgid_strip into vfs, IMO, it only happen new sgid
> > strip situation and doesn't happen to remove old sgid strip situation.
> > So I think it is "safe".
> >>
> >>>
> >>> They are used in filesystem init new inode function and these init
> >>> inode functions are used
> >>> by following operations:
> >>> mkdir
> >>> symlink
> >>> mknod
> >>> create
> >>> tmpfile
> >>> rename
> >>>
> >>> We don't care about mkdir because we don't strip SGID bit for
> >>> directory except fs.xfs.irix_sgid_inherit.
> >>> symlink and rename only use valid mode that doesn't have SGID bit.
> >>>
> >>> We have added inode_sgid_strip api for the remaining operations.
> >>>
> >>> In addition to the above six operations, two filesystems has a little
> >>> difference
> >>> 1) btrfs has btrfs_create_subvol_root to create new inode but used
> >>> non SGID bit mode and can ignore
> >>> 2) ocfs2 reflink function should add inode_sgid_strip api manually
> >>> because we don't add it in vfs
> >>>
> >>> Last but not least, this patch also changed grpid behaviour for
> >>> ext4/xfs because the mode passed to
> >>> them may been changed by inode_sgid_strip.
> >>
> >> I think the patch itself is useful as it would move a security sensitive
> >> operation that is currently burried in individual filesystems into the
> >> vfs layer. But it has a decent regression potential since it might trip
> >> filesystems that have so far relied on getting the S_ISGID bit with a
> >> mode argument. The example being network filesystems that Jeff brought
> >> up earlier. So this needs a lot of testing and long exposure in -next
> >> for at least one full kernel cycle imho.
> > Agreed.
> >>
> >>>
> >>> Suggested-by: Dave Chinner<david@xxxxxxxxxxxxx>
> >>> Signed-off-by: Yang Xu<xuyang2018.jy@xxxxxxxxxxx>
> >>> ---
> >>> fs/inode.c | 4 ----
> >>> fs/namei.c | 5 ++++-
> >>> fs/ocfs2/namei.c | 1 +
> >>> 3 files changed, 5 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/fs/inode.c b/fs/inode.c
> >>> index d63264998855..b08bdd73e116 100644
> >>> --- a/fs/inode.c
> >>> +++ b/fs/inode.c
> >>> @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace
> >>> *mnt_userns, struct inode *inode,
> >>> /* Directories are special, and always inherit S_ISGID */
> >>> if (S_ISDIR(mode))
> >>> mode |= S_ISGID;
> >>> - else if ((mode& (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&&
> >>> - !in_group_p(i_gid_into_mnt(mnt_userns, dir))&&
> >>> - !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> >>> - mode&= ~S_ISGID;
> >>> } else
> >>> inode_fsgid_set(inode, mnt_userns);
> >>> inode->i_mode = mode;
> >>> diff --git a/fs/namei.c b/fs/namei.c
> >>> index 3f1829b3ab5b..e03f7defdd30 100644
> >>> --- a/fs/namei.c
> >>> +++ b/fs/namei.c
> >>> @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct
> >>> nameidata *nd, struct file *file,
> >>> if (open_flag& O_CREAT) {
> >>> if (open_flag& O_EXCL)
> >>> open_flag&= ~O_TRUNC;
> >>> + inode_sgid_strip(mnt_userns, dir->d_inode,&mode);
> >>> if (!IS_POSIXACL(dir->d_inode))
> >>> mode&= ~current_umask();
> >>> if (likely(got_write))
> >>> @@ -3521,6 +3522,7 @@ struct dentry *vfs_tmpfile(struct
> >>> user_namespace *mnt_userns,
> >>> child = d_alloc(dentry,&slash_name);
> >>> if (unlikely(!child))
> >>> goto out_err;
> >>> + inode_sgid_strip(mnt_userns, dir,&mode);
> >>
> >> Hm, an additional question: how is umask stripping currently handled in
> >> vfs_tmpfile()? I don't see it anywhere. That seems like a bug?
> > Yes, I think it is a bug.
> Since you found this bug and I have finished my v3 kernel patch set(also 
> fix this tmpfile umask problem and add your reported-by, also two 
> patches about other problem in xfs/nfs ), so do you will fix this kernel 
> bug or I send a v3 directly?

You can fix it as part of your series and I see you've already included
it in v3 anyway.