Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file

"bfields@xxxxxxxxxxxx" <bfields@xxxxxxxxxxxx> · Wed, 8 Mar 2017 15:32:36 -0500

On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote:
> On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx>
> > > wrote:
> > > 
> > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> > > > 
> > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.o
> > > > > rg>
> > > > > wrote:
> > > > > 
> > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields
> > > > > wrote:
> > > > > > Since copy isn't atomic that check is never going to be
> > > > > > reliable.
> > > > > 
> > > > > That's true for everything that COPY does.  By that logic we
> > > > > should
> > > > > not implement it at all (a logic that I'd fully support)
> > > > 
> > > > If you were to only keep CLONE then you’d lose a huge performance
> > > > gain
> > > > you get from server-to-server COPY. 
> > > 
> > > Yes.  Also, I think copy-like copy implementations have reasonable
> > > semantics that are basically the same as read:
> > > 
> > > 	- copy can return successfully with less copied than requested.
> > > 	- it's fine for the copied range to start and/or end past end
> > > of
> > > 	  file, it'll just return a short read.
> > > 	- A copy of more than 0 bytes returning 0 means you're at end
> > > of
> > > 	  file.
> > > 
> > > The particular problem here is that that doesn't fit how clone
> > > works at
> > > all.
> > > 
> > > It feels like what happened is that copy_file_range() was made
> > > mainly
> > > for the clone case, with the idea that copy might be reluctantly
> > > accepted as a second-class implementation.
> 
> Historically? No... Christoph added clone as a valid implementation of
> copy_file_range() almost a year after Zach and Anna defined the
> semantics of vfs_copy_file_range(). git blame is your friend...

Yeah, I know.  It still feels to me like the interface was originally
designed with clone in mind, but that's my vague impression from the man
pages and half-remembered conversations.

Though the lack of a "just copy the whole file regardless of size" case
is weird for clone.  All you can do is stat the file and then hope it
doesn't change before you issue the copy_file_range.  But I'd think it'd
be easy for an atomic clone implementation to handle, say, getting a
snapshot of a log file while it's getting continuously appended to.

> > > But the performance gain of copy offload is too big to just ignore,
> > > and
> > > in fact it's what copy_file_range does on every filesystem but
> > > btrfs and
> > > ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> > > 
> > > If we had separate copy_file_range and clone_file_range, I *think*
> > > it
> > > could all be made sensible.  Am I missing something?
> > > 
> > 
> > How would the application (cp) know when to call the clone_file_range
> > and when to call copy_file_range?
> 
> cp can probably call copy_file_range(), but any application that needs
> atomic semantics (i.e. a binary operation success/fail) must call
> clone_file_range().

I don't believe there's a clone_file_range().  I see the vfs interface,
but no system call.

And implementing a simple cp is harder than it should be when you don't
know whether it's implemented as copy or clone.  You have to stat for
the file size first, retry if you got it wrong, and also retry if you
get a short read.  The example in the clone_file_range() man page is
incomplete.

--b.