Re: [PATCH v3 0/5] Add a new fchmodat4() syscall

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 25 Jul 2023 14:05:43 +0200

On Tue, Jul 25, 2023 at 01:05:40PM +0200, Alexey Gladkov wrote:
> On Tue, Jul 11, 2023 at 05:14:24PM +0200, Christian Brauner wrote:
> > On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote:
> > > * Alexey Gladkov:
> > > 
> > > > This patch set adds fchmodat4(), a new syscall. The actual
> > > > implementation is super simple: essentially it's just the same as
> > > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> > > > I've attempted to make this match "man 2 fchmodat" as closely as
> > > > possible, which says EINVAL is returned for invalid flags (as opposed to
> > > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> > > > I have a sketch of a glibc patch that I haven't even compiled yet, but
> > > > seems fairly straight-forward:
> > > >
> > > >     diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c
> > > >     index 6d9cbc1ce9e0..b1beab76d56c 100644
> > > >     --- a/sysdeps/unix/sysv/linux/fchmodat.c
> > > >     +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> > > >     @@ -29,12 +29,36 @@
> > > >      int
> > > >      fchmodat (int fd, const char *file, mode_t mode, int flag)
> > > >      {
> > > >     -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> > > >     -    return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> > > >     -#ifndef __NR_lchmod		/* Linux so far has no lchmod syscall.  */
> > > >     +  /* There are four paths through this code:
> > > >     +      - The flags are zero.  In this case it's fine to call fchmodat.
> > > >     +      - The flags are non-zero and glibc doesn't have access to
> > > >     +	__NR_fchmodat4.  In this case all we can do is emulate the error codes
> > > >     +	defined by the glibc interface from userspace.
> > > >     +      - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has
> > > >     +	fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
> > > >     +	matches glibc's library interface so it can be called directly.
> > > >     +      - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does
> > > 
> > > If you define __NR_fchmodat4 on all architectures, we can use these
> > > constants directly in glibc.  We no longer depend on the UAPI
> > > definitions of those constants, to cut down the number of code variants,
> > > and to make glibc's system call profile independent of the kernel header
> > > version at build time.
> > > 
> > > Your version is based on 2.31, more recent versions have some reasonable
> > > emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
> > > describing the same buggy behavior that you witnessed:
> > > 
> > > +      /* Some Linux versions with some file systems can actually
> > > +        change symbolic link permissions via /proc, but this is not
> > > +        intentional, and it gives inconsistent results (e.g., error
> > > +        return despite mode change).  The expected behavior is that
> > > +        symbolic link modes cannot be changed at all, and this check
> > > +        enforces that.  */
> > > +      if (S_ISLNK (st.st_mode))
> > > +       {
> > > +         __close_nocancel (pathfd);
> > > +         __set_errno (EOPNOTSUPP);
> > > +         return -1;
> > > +       }
> > > 
> > > I think there was some kernel discussion about that behavior before, but
> > > apparently, it hasn't led to fixes.
> > 
> > I think I've explained this somewhere else a couple of months ago but
> > just in case you weren't on that thread or don't remember and apologies
> > if you should already know.
> > 
> > A lot of filesystem will happily update the mode of a symlink. The VFS
> > doesn't do anything to prevent this from happening. This is filesystem
> > specific.
> > 
> > The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs.
> > Specifically it comes from filesystems that call posix_acl_chmod(),
> > e.g., btrfs via
> > 
> >         if (!err && attr->ia_valid & ATTR_MODE)
> >                 err = posix_acl_chmod(idmap, dentry, inode->i_mode);
> > 
> > Most filesystems don't implement i_op->set_acl() for POSIX ACLs.
> > So posix_acl_chmod() will report EOPNOTSUPP. By the time
> > posix_acl_chmod() is called, most filesystems will have finished
> > updating the inode. POSIX ACLs also often aren't integrated into
> > transactions so a rollback wouldn't even be possible on some
> > filesystems.
> > 
> > Any filesystem that doesn't implement POSIX ACLs at all will obviously
> > never fail unless it blocks mode changes on symlinks. Or filesystems
> > that do have a way to rollback failures from posix_acl_chmod(), or
> > filesystems that do return an error on chmod() on symlinks such as 9p,
> > ntfs, ocfs2.
> > 
> > > 
> > > I wonder if it makes sense to add a similar error return to the system
> > > call implementation?
> > 
> > Hm, blocking symlink mode changes is pretty regression prone. And just
> > blocking it through one interface seems weird and makes things even more
> > inconsistent.
> > 
> > So two options I see:
> > (1) minimally invasive:
> >     Filesystems that do call posix_acl_chmod() on symlinks need to be
> >     changed to stop doing that.
> > (2) might hit us on the head invasive:
> >     Try and block symlink mode changes in chmod_common().
> > 
> > Thoughts?
> > 
> 
> We have third option. We can choose not to call chmod_common and return an
> error right away:
> 
> diff --git a/fs/open.c b/fs/open.c
> index 39a7939f0d00..86a427a2a083 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -679,7 +679,9 @@ static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, int l
>  retry:
>         error = user_path_at(dfd, filename, lookup_flags, &path);
>         if (!error) {
> -               error = chmod_common(&path, mode);
> +               error = -EOPNOTSUPP;
> +               if (!(flags & AT_SYMLINK_NOFOLLOW) || !S_ISLNK(path.dentry->d_inode->i_mode))
> +                       error = chmod_common(&path, mode);
>                 path_put(&path);
>                 if (retry_estale(error, lookup_flags)) {
>                         lookup_flags |= LOOKUP_REVAL;
> 
> It doesn't seem to be invasive.

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=77b652535528770217186589d97261847f15f862