On Tue, 2019-12-03 at 08:35 -0800, Darrick J. Wong wrote: > On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote: > > On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote: > > > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote: > > > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust > > > > wrote: > > > > > Hi all > > > > > > > > > > A quick question about clone_range() and guarantees around > > > > > metadata > > > > > stability. > > > > > > > > > > Are users required to call fsync/fsync_range() after calling > > > > > clone_range() in order to guarantee that the cloned range > > > > > metadata is > > > > > persisted? > > > > > > > > Yes. > > > > > > > > > I'm assuming that it is required in order to guarantee that > > > > > data is persisted. > > > > > > > > Data and metadata. XFS and ocfs2's reflink implementations > > > > will > > > > flush > > > > the page cache before starting the remap, but they both require > > > > fsync to > > > > force the log/journal to disk. > > > > > > So we need to call xfs_fs_nfs_commit_metadata() to get that done > > > post vfs_clone_file_range() completion on the server side, yes? > > > > > > > I chose to implement this using a full call to vfs_fsync_range(), > > since > > we really do want to ensure data stability as well. Consider, for > > instance, the case where client A is running an application, and > > client > > B runs vfs_clone_file_range() in order to create a point in time > > snapshot of the file for disaster recovery purposes... > > Seems reasonable, since (alas) we didn't define the ->remap_range api > to > guarantee that for you. > > > > > (AFAICT the same reasoning applies to btrfs, but don't trust my > > > > word for > > > > it.) > > > > > > > > > I'm asking because knfsd currently just does a call to > > > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE > > > > > operation. It > > > > > does > > > > > not call fsync()/fsync_range() on the destination file, and > > > > > since > > > > > the > > > > > NFSv4.2 protocol does not require you to perform any other > > > > > operation in > > > > > order to persist data/metadata, I'm worried that we may be > > > > > corrupting > > > > > the cloned file if the NFS server crashes at the wrong moment > > > > > after the > > > > > client has been told the clone completed. > > > > > > Yup, that's exactly what server side calls to commit_metadata() > > > are > > > supposed to address. > > > > > > I suspect to be correct, this might require commit_metadata() to > > > be > > > called on both the source and destination inodes, as both of them > > > may have modified metadata as a result of the clone operation. > > > For > > > XFS one of them will be a no-op, but for other filesystems that > > > don't implement ->commit_metadata, we'll need to call > > > sync_inode_metadata() on both inodes... > > > > > > > That's interesting. I hadn't considered that a clone might cause > > the > > source metadata to change as well. What kind of change specifically > > are > > we talking about? Is it just delayed block allocation, or is there > > more? > > In XFS' case, we added a per-inode flag to help us bypass the > reference > count lookup during a write if the file has never shared any blocks, > so > if you never share anything, you'll never pay any of the runtime > costs > of the COW mechanism. > > ocfs2's design has a reference count tree that is shared between > groups > of files that have been reflinked from each other. So if you start > with > unshared files A and B and clone A to A1 and A2; and B to B1 and B2, > then A* will have their own refcount tree and B* will also have their > own refcount tree. > > Either way, nfs has to assume that changes could have been made to > the > source file. Interesting. Thanks for the explanation! I'll try to send off an amended patch to Bruce (hopefully before he merges). -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx