Re: Question about clone_range() metadata stability

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 2 Dec 2019 09:09:26 -0800

On Mon, Dec 02, 2019 at 08:05:19AM +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > Hi all
> > > 
> > > A quick question about clone_range() and guarantees around metadata
> > > stability.
> > > 
> > > Are users required to call fsync/fsync_range() after calling
> > > clone_range() in order to guarantee that the cloned range metadata is
> > > persisted?
> > 
> > Yes.
> > 
> > > I'm assuming that it is required in order to guarantee that
> > > data is persisted.
> > 
> > Data and metadata.  XFS and ocfs2's reflink implementations will flush
> > the page cache before starting the remap, but they both require fsync to
> > force the log/journal to disk.
> 
> So we need to call xfs_fs_nfs_commit_metadata() to get that done
> post vfs_clone_file_range() completion on the server side, yes?

That sounds like a much better/less hastily researched answer! :)

> 
> > 
> > (AFAICT the same reasoning applies to btrfs, but don't trust my word for
> > it.)
> > 
> > > I'm asking because knfsd currently just does a call to
> > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It does
> > > not call fsync()/fsync_range() on the destination file, and since the
> > > NFSv4.2 protocol does not require you to perform any other operation in
> > > order to persist data/metadata, I'm worried that we may be corrupting
> > > the cloned file if the NFS server crashes at the wrong moment after the
> > > client has been told the clone completed.
> 
> Yup, that's exactly what server side calls to commit_metadata() are
> supposed to address.
> 
> I suspect to be correct, this might require commit_metadata() to be
> called on both the source and destination inodes, as both of them
> may have modified metadata as a result of the clone operation. For
> XFS one of them will be a no-op,

Hmm.  If xfs had to set its reflink flag on the source inode then we
want to ->commit_metadata the source inode to push the log forward far
enough to record the metadata change.  That said, we set the reflink
flag on both inodes before we remap anything, so chances are that
->commit_metadata on the dest inode will be enough to push the log
forward.

I suspect that from NFS' point of view it probably ought to
->commit_metadata both inodes to insulate itself from fs-specific
behaviors and avoid weird crash dataloss bugs.  Someday, someone will
design a filesystem with per-inode logs /and/ hook it up to NFS.

> but for other filesystems that
> don't implement ->commit_metadata, we'll need to call
> sync_inode_metadata() on both inodes...

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx