On Tue, Oct 04, 2022 at 05:01:06PM +0300, Amir Goldstein wrote: > On Tue, Oct 4, 2022 at 1:59 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > On Mon, Oct 03, 2022 at 03:30:40PM +0300, Amir Goldstein wrote: > > > Underlying fs doesn't remove privs because fallocate is called with > > > privileged mounter credentials. > > > > > > This fixes some failure in fstests generic/683..687. > > > > > > Fixes: aab8848cee5e ("ovl: add ovl_fallocate()") > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> > > > --- > > > fs/overlayfs/file.c | 12 +++++++++++- > > > 1 file changed, 11 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > > > index c8308da8909a..e90ac5376456 100644 > > > --- a/fs/overlayfs/file.c > > > +++ b/fs/overlayfs/file.c > > > @@ -517,9 +517,16 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len > > > const struct cred *old_cred; > > > int ret; > > > > > > + inode_lock(inode); > > > + /* Update mode */ > > > + ovl_copyattr(inode); > > > + ret = file_remove_privs(file); > > > > First, thank you for picking this up! > > > > Let me analyze generic/683 failure of Test1 to see why you still see > > failures in this test: > > > > echo "Test 1 - qa_user, non-exec file $verb" > > setup_testfile > > chmod a+rws $junk_file > > commit_and_check "$qa_user" "$verb" 64k 64k > > > > So this creates a file with 6666 permissions. While the file has the > > S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. This is > > important in a little bit. > > > > On a regular filesystem like xfs what will happen is: > > > > sys_fallocate() > > -> vfs_fallocate() > > -> xfs_file_fallocate() > > -> file_modified() > > -> __file_remove_privs() > > -> dentry_needs_remove_privs() > > -> should_remove_suid() > > -> __remove_privs() > > newattrs.ia_valid = ATTR_FORCE | kill; > > -> notify_change() > > > > In should_remove_suid() we can see that ATTR_KILL_SUID is raised > > unconditionally because the file in the test has S_ISUID set. > > > > But we also see that ATTR_KILL_SGID won't be set because while the file > > is S_ISGID it is not S_IXGRP (see above) which is a condition for > > ATTR_KILL_SGID being raised. > > > > So by the time we call notify_change() we have attr->ia_valid set to > > ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that > > ATTR_KILL_SUID is set and does: > > > > ia_valid = attr->ia_valid |= ATTR_MODE > > attr->ia_mode = (inode->i_mode & ~S_ISUID); > > > > which means that when we call setattr_copy() later we will definitely > > update inode->i_mode. Note that attr->ia_mode still contain S_ISGID. > > > > Now we call into the filesystem's ->setattr() inode operation which will end up > > calling setattr_copy(). Since ATTR_MODE is set we will hit: > > > > if (ia_valid & ATTR_MODE) { > > umode_t mode = attr->ia_mode; > > vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); > > if (!vfsgid_in_group_p(vfsgid) && > > !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) > > mode &= ~S_ISGID; > > inode->i_mode = mode; > > } > > > > Can you think of a reason why the above should not be done > in notify_change() before even calling to ->setattr()? > > Although, it wouldn't help because ovl_setattr() does: > > if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID)) > attr->ia_valid &= ~ATTR_MODE; > > > and since the caller in the test is neither capable nor in the group of the > > inode the S_ISGID bit is stripped. > > > > But now contrast this with overlayfs even after your changes. When > > ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() > > and calls ovl_do_notify_change() then we are doing this under the > > mounter's creds and so the S_ISGID bit is retained: > > > > sys_fallocate() > > -> vfs_fallocate() > > -> ovl_fallocate() > > -> file_remove_privs() > > -> dentry_needs_remove_privs() > > -> should_remove_suid() > > -> __remove_privs() > > newattrs.ia_valid = attr_force | kill; > > -> notify_change() > > -> ovl_setattr() > > // TAKE ON MOUNTER'S CREDS > > -> ovl_do_notify_change() > > // GIVE UP MOUNTER'S CREDS > > // TAKE ON MOUNTER'S CREDS > > -> vfs_fallocate() > > -> xfs_file_fallocate() > > -> file_modified() > > -> __file_remove_privs() > > -> dentry_needs_remove_privs() > > -> should_remove_suid() > > -> __remove_privs() > > newattrs.ia_valid = attr_force | kill; > > -> notify_change() > > The model in overlayfs is that security is checked twice > once on overlay inode with caller creds and once again > on xfs inode with mounter creds. Either of these checks > could result in clearing SUID/SGID bits. Yep. > > In the call stack above, the outer should_remove_suid() > with caller creds sets ATTR_KILL_SUID and then the outer > notify_change() clears SUID and sets ATTR_MODE, Yes. > but ovl_setattr() clears ATTR_MODE and then the inner > notify_change() re-clears SUID and sets ATTR_MODE again. Yes. > > If the outer notify_change() would have checked the in_group_p() > condition, clear SGID and set a flag ATTR_KILL_SGID_FORCE > then the inner notify_change() would see this flag and re-clear > SGID bit, just the same as it does with SUID bit in the stack stace > above. > > Is this making any sense? What I kept thinking was sm along the lines of: diff --git a/fs/inode.c b/fs/inode.c index ba1de23c13c1..e62a564201b7 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1968,8 +1968,12 @@ int should_remove_suid(struct dentry *dentry) * sgid without any exec bits is just a mandatory locking mark; leave * it alone. If some exec bits are set, it's a real sgid; kill it. */ - if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) - kill |= ATTR_KILL_SGID; + if (unlikely(mode & S_ISGID)) { + if ((mode & S_IXGRP) || + (!vfsgid_in_group_p(vfsgid) && + !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))) + kill |= ATTR_KILL_SGID; + } if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) return kill; mandatory locks have been removed as well so that remark seems pointless as well?