On Wed, 2017-03-08 at 15:32 -0500, bfields@xxxxxxxxxxxx wrote: > On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote: > > On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote: > > > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.o > > > > rg> > > > > wrote: > > > > > > > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia > > > > wrote: > > > > > > > > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infrade > > > > > > ad.o > > > > > > rg> > > > > > > wrote: > > > > > > > > > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields > > > > > > wrote: > > > > > > > Since copy isn't atomic that check is never going to be > > > > > > > reliable. > > > > > > > > > > > > That's true for everything that COPY does. By that logic > > > > > > we > > > > > > should > > > > > > not implement it at all (a logic that I'd fully support) > > > > > > > > > > If you were to only keep CLONE then you’d lose a huge > > > > > performance > > > > > gain > > > > > you get from server-to-server COPY. > > > > > > > > Yes. Also, I think copy-like copy implementations have > > > > reasonable > > > > semantics that are basically the same as read: > > > > > > > > - copy can return successfully with less copied than > > > > requested. > > > > - it's fine for the copied range to start and/or end > > > > past end > > > > of > > > > file, it'll just return a short read. > > > > - A copy of more than 0 bytes returning 0 means you're > > > > at end > > > > of > > > > file. > > > > > > > > The particular problem here is that that doesn't fit how clone > > > > works at > > > > all. > > > > > > > > It feels like what happened is that copy_file_range() was made > > > > mainly > > > > for the clone case, with the idea that copy might be > > > > reluctantly > > > > accepted as a second-class implementation. > > > > Historically? No... Christoph added clone as a valid implementation > > of > > copy_file_range() almost a year after Zach and Anna defined the > > semantics of vfs_copy_file_range(). git blame is your friend... > > Yeah, I know. It still feels to me like the interface was originally > designed with clone in mind, but that's my vague impression from the > man > pages and half-remembered conversations. > > Though the lack of a "just copy the whole file regardless of size" > case > is weird for clone. All you can do is stat the file and then hope it > doesn't change before you issue the copy_file_range. But I'd think > it'd > be easy for an atomic clone implementation to handle, say, getting a > snapshot of a log file while it's getting continuously appended to. It really isn't that interesting in the continuously appended case (what difference does it make if you only get data from just a few moments ago), but I can see it being an issue in the case of random writes where the file size is being extended. The thing is that in both those cases, the copy_file_range() semantics are worse, since they don't even guarantee a time-consistent copy. > > > > But the performance gain of copy offload is too big to just > > > > ignore, > > > > and > > > > in fact it's what copy_file_range does on every filesystem but > > > > btrfs and > > > > ocfs2 (and maybe cifs?), so I don't think we can just ignore > > > > it. > > > > > > > > If we had separate copy_file_range and clone_file_range, I > > > > *think* > > > > it > > > > could all be made sensible. Am I missing something? > > > > > > > > > > How would the application (cp) know when to call the > > > clone_file_range > > > and when to call copy_file_range? > > > > cp can probably call copy_file_range(), but any application that > > needs > > atomic semantics (i.e. a binary operation success/fail) must call > > clone_file_range(). > > I don't believe there's a clone_file_range(). I see the vfs > interface, > but no system call. There is a standard FICLONERANGE ioctl() that can be used on all filesystems that support the vfs interface. > And implementing a simple cp is harder than it should be when you > don't > know whether it's implemented as copy or clone. You have to stat for > the file size first, retry if you got it wrong, and also retry if you > get a short read. The example in the clone_file_range() man page is > incomplete. As I said, you shouldn't be using copy_file_range() either in the case where the file is being modified. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx