Re: Question about clone_range() metadata stability

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 3 Dec 2019 08:35:26 -0800

On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote:
> On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > > Hi all
> > > > 
> > > > A quick question about clone_range() and guarantees around
> > > > metadata
> > > > stability.
> > > > 
> > > > Are users required to call fsync/fsync_range() after calling
> > > > clone_range() in order to guarantee that the cloned range
> > > > metadata is
> > > > persisted?
> > > 
> > > Yes.
> > > 
> > > > I'm assuming that it is required in order to guarantee that
> > > > data is persisted.
> > > 
> > > Data and metadata.  XFS and ocfs2's reflink implementations will
> > > flush
> > > the page cache before starting the remap, but they both require
> > > fsync to
> > > force the log/journal to disk.
> > 
> > So we need to call xfs_fs_nfs_commit_metadata() to get that done
> > post vfs_clone_file_range() completion on the server side, yes?
> > 
> 
> I chose to implement this using a full call to vfs_fsync_range(), since
> we really do want to ensure data stability as well. Consider, for
> instance, the case where client A is running an application, and client
> B runs vfs_clone_file_range() in order to create a point in time
> snapshot of the file for disaster recovery purposes...

Seems reasonable, since (alas) we didn't define the ->remap_range api to
guarantee that for you.

> > > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > > word for
> > > it.)
> > > 
> > > > I'm asking because knfsd currently just does a call to
> > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It
> > > > does
> > > > not call fsync()/fsync_range() on the destination file, and since
> > > > the
> > > > NFSv4.2 protocol does not require you to perform any other
> > > > operation in
> > > > order to persist data/metadata, I'm worried that we may be
> > > > corrupting
> > > > the cloned file if the NFS server crashes at the wrong moment
> > > > after the
> > > > client has been told the clone completed.
> > 
> > Yup, that's exactly what server side calls to commit_metadata() are
> > supposed to address.
> > 
> > I suspect to be correct, this might require commit_metadata() to be
> > called on both the source and destination inodes, as both of them
> > may have modified metadata as a result of the clone operation. For
> > XFS one of them will be a no-op, but for other filesystems that
> > don't implement ->commit_metadata, we'll need to call
> > sync_inode_metadata() on both inodes...
> > 
> 
> That's interesting. I hadn't considered that a clone might cause the
> source metadata to change as well. What kind of change specifically are
> we talking about? Is it just delayed block allocation, or is there
> more?

In XFS' case, we added a per-inode flag to help us bypass the reference
count lookup during a write if the file has never shared any blocks, so
if you never share anything, you'll never pay any of the runtime costs
of the COW mechanism.

ocfs2's design has a reference count tree that is shared between groups
of files that have been reflinked from each other.  So if you start with
unshared files A and B and clone A to A1 and A2; and B to B1 and B2,
then A* will have their own refcount tree and B* will also have their
own refcount tree.

Either way, nfs has to assume that changes could have been made to the
source file.

--D

> Thanks
>   Trond
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
> 
>