On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote: > On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote: > > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote: > > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote: > > > > Hi all > > > > > > > > A quick question about clone_range() and guarantees around > > > > metadata > > > > stability. > > > > > > > > Are users required to call fsync/fsync_range() after calling > > > > clone_range() in order to guarantee that the cloned range > > > > metadata is > > > > persisted? > > > > > > Yes. > > > > > > > I'm assuming that it is required in order to guarantee that > > > > data is persisted. > > > > > > Data and metadata. XFS and ocfs2's reflink implementations will > > > flush > > > the page cache before starting the remap, but they both require > > > fsync to > > > force the log/journal to disk. > > > > So we need to call xfs_fs_nfs_commit_metadata() to get that done > > post vfs_clone_file_range() completion on the server side, yes? > > > > I chose to implement this using a full call to vfs_fsync_range(), since > we really do want to ensure data stability as well. Consider, for > instance, the case where client A is running an application, and client > B runs vfs_clone_file_range() in order to create a point in time > snapshot of the file for disaster recovery purposes... Seems reasonable, since (alas) we didn't define the ->remap_range api to guarantee that for you. > > > (AFAICT the same reasoning applies to btrfs, but don't trust my > > > word for > > > it.) > > > > > > > I'm asking because knfsd currently just does a call to > > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It > > > > does > > > > not call fsync()/fsync_range() on the destination file, and since > > > > the > > > > NFSv4.2 protocol does not require you to perform any other > > > > operation in > > > > order to persist data/metadata, I'm worried that we may be > > > > corrupting > > > > the cloned file if the NFS server crashes at the wrong moment > > > > after the > > > > client has been told the clone completed. > > > > Yup, that's exactly what server side calls to commit_metadata() are > > supposed to address. > > > > I suspect to be correct, this might require commit_metadata() to be > > called on both the source and destination inodes, as both of them > > may have modified metadata as a result of the clone operation. For > > XFS one of them will be a no-op, but for other filesystems that > > don't implement ->commit_metadata, we'll need to call > > sync_inode_metadata() on both inodes... > > > > That's interesting. I hadn't considered that a clone might cause the > source metadata to change as well. What kind of change specifically are > we talking about? Is it just delayed block allocation, or is there > more? In XFS' case, we added a per-inode flag to help us bypass the reference count lookup during a write if the file has never shared any blocks, so if you never share anything, you'll never pay any of the runtime costs of the COW mechanism. ocfs2's design has a reference count tree that is shared between groups of files that have been reflinked from each other. So if you start with unshared files A and B and clone A to A1 and A2; and B to B1 and B2, then A* will have their own refcount tree and B* will also have their own refcount tree. Either way, nfs has to assume that changes could have been made to the source file. --D > Thanks > Trond > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >