On Tue, Oct 04, 2022 at 04:13:05PM +0200, Christian Brauner wrote: > On Tue, Oct 04, 2022 at 05:01:06PM +0300, Amir Goldstein wrote: > > On Tue, Oct 4, 2022 at 1:59 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > > > On Mon, Oct 03, 2022 at 03:30:40PM +0300, Amir Goldstein wrote: > > > > Underlying fs doesn't remove privs because fallocate is called with > > > > privileged mounter credentials. > > > > > > > > This fixes some failure in fstests generic/683..687. > > > > > > > > Fixes: aab8848cee5e ("ovl: add ovl_fallocate()") > > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> > > > > --- > > > > fs/overlayfs/file.c | 12 +++++++++++- > > > > 1 file changed, 11 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > > > > index c8308da8909a..e90ac5376456 100644 > > > > --- a/fs/overlayfs/file.c > > > > +++ b/fs/overlayfs/file.c > > > > @@ -517,9 +517,16 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len > > > > const struct cred *old_cred; > > > > int ret; > > > > > > > > + inode_lock(inode); > > > > + /* Update mode */ > > > > + ovl_copyattr(inode); > > > > + ret = file_remove_privs(file); > > > > > > First, thank you for picking this up! > > > > > > Let me analyze generic/683 failure of Test1 to see why you still see > > > failures in this test: > > > > > > echo "Test 1 - qa_user, non-exec file $verb" > > > setup_testfile > > > chmod a+rws $junk_file > > > commit_and_check "$qa_user" "$verb" 64k 64k > > > > > > So this creates a file with 6666 permissions. While the file has the > > > S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. This is > > > important in a little bit. > > > > > > On a regular filesystem like xfs what will happen is: > > > > > > sys_fallocate() > > > -> vfs_fallocate() > > > -> xfs_file_fallocate() > > > -> file_modified() > > > -> __file_remove_privs() > > > -> dentry_needs_remove_privs() > > > -> should_remove_suid() > > > -> __remove_privs() > > > newattrs.ia_valid = ATTR_FORCE | kill; > > > -> notify_change() > > > > > > In should_remove_suid() we can see that ATTR_KILL_SUID is raised > > > unconditionally because the file in the test has S_ISUID set. > > > > > > But we also see that ATTR_KILL_SGID won't be set because while the file > > > is S_ISGID it is not S_IXGRP (see above) which is a condition for > > > ATTR_KILL_SGID being raised. > > > > > > So by the time we call notify_change() we have attr->ia_valid set to > > > ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that > > > ATTR_KILL_SUID is set and does: > > > > > > ia_valid = attr->ia_valid |= ATTR_MODE > > > attr->ia_mode = (inode->i_mode & ~S_ISUID); > > > > > > which means that when we call setattr_copy() later we will definitely > > > update inode->i_mode. Note that attr->ia_mode still contain S_ISGID. > > > > > > Now we call into the filesystem's ->setattr() inode operation which will end up > > > calling setattr_copy(). Since ATTR_MODE is set we will hit: > > > > > > if (ia_valid & ATTR_MODE) { > > > umode_t mode = attr->ia_mode; > > > vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); > > > if (!vfsgid_in_group_p(vfsgid) && > > > !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) > > > mode &= ~S_ISGID; > > > inode->i_mode = mode; > > > } > > > > > > > Can you think of a reason why the above should not be done > > in notify_change() before even calling to ->setattr()? > > > > Although, it wouldn't help because ovl_setattr() does: > > > > if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID)) > > attr->ia_valid &= ~ATTR_MODE; > > > > > and since the caller in the test is neither capable nor in the group of the > > > inode the S_ISGID bit is stripped. > > > > > > But now contrast this with overlayfs even after your changes. When > > > ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() > > > and calls ovl_do_notify_change() then we are doing this under the > > > mounter's creds and so the S_ISGID bit is retained: > > > > > > sys_fallocate() > > > -> vfs_fallocate() > > > -> ovl_fallocate() > > > -> file_remove_privs() > > > -> dentry_needs_remove_privs() > > > -> should_remove_suid() > > > -> __remove_privs() > > > newattrs.ia_valid = attr_force | kill; > > > -> notify_change() > > > -> ovl_setattr() > > > // TAKE ON MOUNTER'S CREDS > > > -> ovl_do_notify_change() > > > // GIVE UP MOUNTER'S CREDS > > > // TAKE ON MOUNTER'S CREDS > > > -> vfs_fallocate() > > > -> xfs_file_fallocate() > > > -> file_modified() > > > -> __file_remove_privs() > > > -> dentry_needs_remove_privs() > > > -> should_remove_suid() > > > -> __remove_privs() > > > newattrs.ia_valid = attr_force | kill; > > > -> notify_change() > > > > The model in overlayfs is that security is checked twice > > once on overlay inode with caller creds and once again > > on xfs inode with mounter creds. Either of these checks > > could result in clearing SUID/SGID bits. > > Yep. > > > > > In the call stack above, the outer should_remove_suid() > > with caller creds sets ATTR_KILL_SUID and then the outer > > notify_change() clears SUID and sets ATTR_MODE, > > Yes. > > > but ovl_setattr() clears ATTR_MODE and then the inner > > notify_change() re-clears SUID and sets ATTR_MODE again. > > Yes. > > > > > If the outer notify_change() would have checked the in_group_p() > > condition, clear SGID and set a flag ATTR_KILL_SGID_FORCE > > then the inner notify_change() would see this flag and re-clear > > SGID bit, just the same as it does with SUID bit in the stack stace > > above. > > > > Is this making any sense? > > What I kept thinking was sm along the lines of: > > diff --git a/fs/inode.c b/fs/inode.c > index ba1de23c13c1..e62a564201b7 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1968,8 +1968,12 @@ int should_remove_suid(struct dentry *dentry) > * sgid without any exec bits is just a mandatory locking mark; leave > * it alone. If some exec bits are set, it's a real sgid; kill it. > */ > - if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) > - kill |= ATTR_KILL_SGID; > + if (unlikely(mode & S_ISGID)) { > + if ((mode & S_IXGRP) || > + (!vfsgid_in_group_p(vfsgid) && > + !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))) > + kill |= ATTR_KILL_SGID; > + } > > if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) > return kill; > > mandatory locks have been removed as well so that remark seems pointless > as well? My feeling here is that both should_remove_suid() and notify_change() need to apply the same permission checks as setattr_prepare() and setattr_copy() instead of all this special casing them. I don't see a good reason to not require the same checks. So sm like (__completely untested__): >From 922f9f123ab6531c29bf05585ef88c17fe65dba3 Mon Sep 17 00:00:00 2001 From: Christian Brauner <brauner@xxxxxxxxxx> Date: Tue, 4 Oct 2022 16:13:34 +0200 Subject: [UNTESTED PATCH] attr: use consistent sgid stripping checks Require the same permissions as setattr_prepare() and setattr_copy() have instead of all these special cases. We can probably consolidate this even more... Signed-off-by: Christian Brauner (Microsoft) <brauner@xxxxxxxxxx> --- fs/attr.c | 2 +- fs/fuse/file.c | 2 +- fs/inode.c | 41 ++++++++++++++++++++++++----------------- fs/internal.h | 4 +++- fs/ocfs2/file.c | 4 ++-- fs/open.c | 2 +- include/linux/fs.h | 2 +- 7 files changed, 33 insertions(+), 24 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 1552a5f23d6b..9262c6b31c26 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -375,7 +375,7 @@ int notify_change(struct user_namespace *mnt_userns, struct dentry *dentry, } } if (ia_valid & ATTR_KILL_SGID) { - if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) { + if (should_remove_sgid(mnt_userns, dentry)) { if (!(ia_valid & ATTR_MODE)) { ia_valid = attr->ia_valid |= ATTR_MODE; attr->ia_mode = inode->i_mode; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1a3afd469e3a..fccc2c7e88fd 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1313,7 +1313,7 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from) return err; if (fc->handle_killpriv_v2 && - should_remove_suid(file_dentry(file))) { + should_remove_suid(&init_user_ns, file_dentry(file))) { goto writethrough; } diff --git a/fs/inode.c b/fs/inode.c index ba1de23c13c1..c639aefe01c3 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1949,27 +1949,33 @@ void touch_atime(const struct path *path) } EXPORT_SYMBOL(touch_atime); -/* - * The logic we want is - * - * if suid or (sgid and xgrp) - * remove privs - */ -int should_remove_suid(struct dentry *dentry) +bool should_remove_sgid(struct user_namespace *mnt_userns, struct dentry *dentry) +{ + struct inode *inode = d_inode(dentry); + umode_t mode = inode->i_mode; + + if (unlikely(mode & S_ISGID)) { + if ((mode & S_IXGRP) || + (!vfsgid_in_group_p(i_gid_into_vfsgid(mnt_userns, inode)) && + !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))) + return true; + } + + return false; +} + +int should_remove_suid(struct user_namespace *mnt_userns, struct dentry *dentry) { - umode_t mode = d_inode(dentry)->i_mode; + struct inode *inode = d_inode(dentry); + umode_t mode = inode->i_mode; int kill = 0; /* suid always must be killed */ if (unlikely(mode & S_ISUID)) kill = ATTR_KILL_SUID; - /* - * sgid without any exec bits is just a mandatory locking mark; leave - * it alone. If some exec bits are set, it's a real sgid; kill it. - */ - if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) - kill |= ATTR_KILL_SGID; + if (should_remove_sgid(mnt_userns, dentry)) + kill = ATTR_KILL_SGID; if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) return kill; @@ -1983,7 +1989,8 @@ EXPORT_SYMBOL(should_remove_suid); * response to write or truncate. Return 0 if nothing has to be changed. * Negative value on error (change should be denied). */ -int dentry_needs_remove_privs(struct dentry *dentry) +int dentry_needs_remove_privs(struct user_namespace *mnt_userns, + struct dentry *dentry) { struct inode *inode = d_inode(dentry); int mask = 0; @@ -1992,7 +1999,7 @@ int dentry_needs_remove_privs(struct dentry *dentry) if (IS_NOSEC(inode)) return 0; - mask = should_remove_suid(dentry); + mask = should_remove_suid(mnt_userns, dentry); ret = security_inode_need_killpriv(dentry); if (ret < 0) return ret; @@ -2024,7 +2031,7 @@ static int __file_remove_privs(struct file *file, unsigned int flags) if (IS_NOSEC(inode) || !S_ISREG(inode->i_mode)) return 0; - kill = dentry_needs_remove_privs(dentry); + kill = dentry_needs_remove_privs(file_mnt_user_ns(file), dentry); if (kill < 0) return kill; diff --git a/fs/internal.h b/fs/internal.h index 1e67b4b9a4d1..ae152ded227c 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -139,7 +139,8 @@ extern int vfs_open(const struct path *, struct file *); * inode.c */ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); -extern int dentry_needs_remove_privs(struct dentry *dentry); +extern int dentry_needs_remove_privs(struct user_namespace *, + struct dentry *dentry); /* * fs-writeback.c @@ -226,3 +227,4 @@ int do_set_acl(struct user_namespace *mnt_userns, struct dentry *dentry, const char *acl_name, const void *kvalue, size_t size); ssize_t do_get_acl(struct user_namespace *mnt_userns, struct dentry *dentry, const char *acl_name, void *kvalue, size_t size); +int should_remove_sgid(struct user_namespace *mnt_userns, struct dentry *dentry); diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index af900aaa9275..0a33096a3852 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1991,7 +1991,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, } } - if (file && should_remove_suid(file->f_path.dentry)) { + if (file && should_remove_suid(&init_user_ns, file->f_path.dentry)) { ret = __ocfs2_write_remove_suid(inode, di_bh); if (ret) { mlog_errno(ret); @@ -2279,7 +2279,7 @@ static int ocfs2_prepare_inode_for_write(struct file *file, * inode. There's also the dinode i_size state which * can be lost via setattr during extending writes (we * set inode->i_size at the end of a write. */ - if (should_remove_suid(dentry)) { + if (should_remove_suid(&init_user_ns, dentry)) { if (meta_level == 0) { ocfs2_inode_unlock_for_extent_tree(inode, &di_bh, diff --git a/fs/open.c b/fs/open.c index 8a813fa5ca56..ecb7b8a58275 100644 --- a/fs/open.c +++ b/fs/open.c @@ -54,7 +54,7 @@ int do_truncate(struct user_namespace *mnt_userns, struct dentry *dentry, } /* Remove suid, sgid, and file capabilities on truncate too */ - ret = dentry_needs_remove_privs(dentry); + ret = dentry_needs_remove_privs(mnt_userns, dentry); if (ret < 0) return ret; if (ret) diff --git a/include/linux/fs.h b/include/linux/fs.h index badff81b9dde..58347514421e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3092,7 +3092,7 @@ extern void __destroy_inode(struct inode *); extern struct inode *new_inode_pseudo(struct super_block *sb); extern struct inode *new_inode(struct super_block *sb); extern void free_inode_nonrcu(struct inode *inode); -extern int should_remove_suid(struct dentry *); +extern int should_remove_suid(struct user_namespace *, struct dentry *); extern int file_remove_privs(struct file *); /* -- 2.34.1