Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 7 Dec 2018 08:30:58 +1100

On Thu, Dec 06, 2018 at 06:16:46AM +0200, Amir Goldstein wrote:
> On Tue, Dec 4, 2018 at 1:02 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Dec 03, 2018 at 12:22:21PM +0200, Amir Goldstein wrote:
> > > On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > >
> > > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > >
> > > > Now that we have generic_copy_file_range(), remove it as a fallback
> > > > case when offloads fail. This puts the responsibility for executing
> > > > fallbacks on the filesystems that implement ->copy_file_range and
> > > > allows us to add operational validity checks to
> > > > generic_copy_file_range().
> > > >
> > > > Rework vfs_copy_file_range() to call a new do_copy_file_range()
> > > > helper to exceute the copying callout, and move calls to
> > > > generic_file_copy_range() into filesystem methods where they
> > > > currently return failures.
> > > >
> > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > >
> > > You may add
> > > Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx>
> > >
> > > After fixing the overlayfs issue below.
> > > ...
> > >
> > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > > index 84dd957efa24..68736e5d6a56 100644
> > > > --- a/fs/overlayfs/file.c
> > > > +++ b/fs/overlayfs/file.c
> > > > @@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
> > > >                                    struct file *file_out, loff_t pos_out,
> > > >                                    size_t len, unsigned int flags)
> > > >  {
> > > > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > > > +       ssize_t ret;
> > > > +
> > > > +       ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > > >                             OVL_COPY);
> > > > +
> > > > +       if (ret == -EOPNOTSUPP)
> > > > +               ret = generic_copy_file_range(file_in, pos_in, file_out,
> > > > +                                       pos_out, len, flags);
> > > > +       return ret;
> > > >  }
> > > >
> > >
> > > This is unneeded, because ovl_copyfile(OVL_COPY) is implemented
> > > by calling vfs_copy_file_range() (on the underlying files) and it is
> > > not possible
> > > to get EOPNOTSUPP from vfs_copy_file_range().
> >
> > Except that it is possible. e.g. If the underlying filesystem tries
> > a copy offload, gets a "not supported" failure from the remote
> > server and then doesn't implement a fallback.
> >
> 
> I'm in the opinion that ovl_copy_file_range() and do_copy_file_range()
> are a like. If you choose not to fallback in the latter to
> generic_copy_file_range() for misbehaving filesystem and WARN_ON
> this case, there is no reason for overlayfs to cover up for the
> misbehaving underlying filesystem.
> 
> If you want to cover up for misbehaving filesystem, please do it
> in do_copy_file_range() and drop the WARN_ON_ONCE().
> Come to think about it, I understand your reasoning for pushing
> generic_copy_file_range() down to filesystems so they can fallback to
> it in several error conditions.
> I do not follow the reasoning of NOT falling back to
> generic_copy_file_range() in vfs if EOPNOTSUPP is returned from
> filesystem. IOW, if we want to cover up for misbehaving filesystem,
> this would have been a more robust code:

Since when have we defined a filesystem returning -EOPNOTSUPP as a
"misbehaving filesystem"? Userspace has to handle errors in
copy_file_range() with it's own fallback copy code (i.e. it cannot
rely on the kernel actually supporting copy_file_range at all).
Hence it's perfectly fine for a filesystem implementation to encode
"offload or fail entirely" semantics if they want.

Yes, I've been shouted at by developers quite recently who
*demanded* that copy_file_range (and other offloads like
fallocate(ZERO_RANGE)) *fail* if they cannot "offload" the operation
to make it "fast". The application developers want to use different
algorithms if the kernel offload isn't any faster than userspace
doing the dumb thing and phsyically pushing bytes around itself.

I've pushed back on this as much as I can, but it doesn't change the
fact that for many situations doing do_splice_direct() is exactly
the wrong thing to do (e.g. because copy_file_range() on a TB+ scale
file couldn't be offloaded by the filesystem because the server said
EOPNOTSUPP)

IOWs, for some filesystems or situations where it makes sense to
have fail-fast semantics and leave the decision of what to do next
in the hands of the userspace application that has the context
necessary to determine what the best action to take is.  And to do
that, we need to give control of the fallback to the filesystems.

Flexibility is what is needed here, not a dumb, hard coded "the VFS
always know what's right for you" policy that triggers when nobody
really wants it to.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx