Re: Question about clone_range() metadata stability

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 3 Dec 2019 07:36:29 +0000

On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > Hi all
> > > 
> > > A quick question about clone_range() and guarantees around
> > > metadata
> > > stability.
> > > 
> > > Are users required to call fsync/fsync_range() after calling
> > > clone_range() in order to guarantee that the cloned range
> > > metadata is
> > > persisted?
> > 
> > Yes.
> > 
> > > I'm assuming that it is required in order to guarantee that
> > > data is persisted.
> > 
> > Data and metadata.  XFS and ocfs2's reflink implementations will
> > flush
> > the page cache before starting the remap, but they both require
> > fsync to
> > force the log/journal to disk.
> 
> So we need to call xfs_fs_nfs_commit_metadata() to get that done
> post vfs_clone_file_range() completion on the server side, yes?
> 

I chose to implement this using a full call to vfs_fsync_range(), since
we really do want to ensure data stability as well. Consider, for
instance, the case where client A is running an application, and client
B runs vfs_clone_file_range() in order to create a point in time
snapshot of the file for disaster recovery purposes...

> > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > word for
> > it.)
> > 
> > > I'm asking because knfsd currently just does a call to
> > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It
> > > does
> > > not call fsync()/fsync_range() on the destination file, and since
> > > the
> > > NFSv4.2 protocol does not require you to perform any other
> > > operation in
> > > order to persist data/metadata, I'm worried that we may be
> > > corrupting
> > > the cloned file if the NFS server crashes at the wrong moment
> > > after the
> > > client has been told the clone completed.
> 
> Yup, that's exactly what server side calls to commit_metadata() are
> supposed to address.
> 
> I suspect to be correct, this might require commit_metadata() to be
> called on both the source and destination inodes, as both of them
> may have modified metadata as a result of the clone operation. For
> XFS one of them will be a no-op, but for other filesystems that
> don't implement ->commit_metadata, we'll need to call
> sync_inode_metadata() on both inodes...
> 

That's interesting. I hadn't considered that a clone might cause the
source metadata to change as well. What kind of change specifically are
we talking about? Is it just delayed block allocation, or is there
more?

Thanks
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx