Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 19 Sep 2023 15:51:24 +1000

On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> Hi all,
> 
> Darrick and I have been working on designing a new ioctl FICLONERANGE2. The
> following text attempts to explain our needs and reasoning behind this decision. 
> 
> 
> Contents
> --------
> 1. Problem Statement
> 2. Proof of Concept
> 3. Proposed Solution
> 4. User Interface
> 5. Testing Plan
> 
> 
> 1. Problem Statement
> --------------------
> 
> One of our VM cluster management products needs to snapshot KVM image files
> so that they can be restored in case of failure. Snapshotting is done by
> redirecting VM disk writes to a sidecar file and using reflink on the disk
> image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink
> locks the source and destination files while it operates, which means that
> reads from the main vm disk image are blocked, causing the vm to stall. When
> an image file is heavily fragmented, the copy process could take several
> minutes. Some of the vm image files have 50-100 million extent records, and
> duplicating that much metadata locks the file for 30 minutes or more. Having
> activities suspended for such a long time in a cluster node could result in
> node eviction. A node eviction occurs when the cluster manager determines
> that the vm is unresponsive. One of the criteria for determining that a VM
> is unresponsive is the failure of filesystems in the guest to respond for an
> unacceptably long time. In order to solve this problem, we need to provide a
> variant of FICLONE that releases the file locks periodically to allow reads
> to occur as vmbackup runs. The purpose of this feature is to allow vmbackup
> to run without causing downtime.

Interesting problem to have - let me see if I understand it
properly.

Writes are redirected away from the file being cloned, but reads go
directly to the source file being cloned?

But cloning can take a long time, so breaking up the clone operation
into multiple discrete ranges will allow reads through
to the file being cloned with minimal latency. However, you don't
want writes to the source file because that results in the
atomicity of the clone operation being violated and corrupting the
snapshot.

Hence the redirected writes ensure that the file being cloned does
not change from syscall to syscall. This means the time interrupted
clone operation can restart from where it left off and you still get
an consistent image clone for the snapshot.

Did I get that right?

If so, I'm wondering about the general usefulness of this
multi-syscall construct - having to ensure that it isn't written to
between syscalls is quite the constraint.

I wonder if we can do better than that and not need a new syscall;
shared read + clone seems more like an inode extent list access
serialisation problem than anything else...

<thinks for a bit>

Ok. a clone does not change any data in the source file.

Neither do read IO operations.

Hence from a data integrity perspective, there's no reason why read
IO and FICLONE can't run concurrently on the source file.

Writes we still need to block so that the clone is an atomic
point in time image of the file, but reads could be allowed.

The XFS clone implementation takes the IOLOCK_EXCL high up, and
then lower down it iterates one extent doing the sharing operation.
It holds the ILOCK_EXCL while it is modifying the extent in both the
source and destination files, then commits the transaction and drops
the ILOCKs.

OK, so we have fine-grained ILOCK serialisation during the clone for
access/modification to the extent list. Excellent, I think we can
make this work.

So:

1. take IOLOCK_EXCL like we already do on the source and destination
files.

2. Once all the pre work is done, set a "clone in progress" flag on
the in-memory source inode.

3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.

4. read IO and the clone serialise access to the extent list via the
ILOCK. We know this works fine, because that's how the extent list
access serialisation for concurrent read and write direct IO works.

5. buffered writes take the IOLOCK_EXCL, so they block until the
clone completes. Same behaviour as right now, all good.

6. direct IO writes need to be modified to check the "clone in
progress" flag after taking the IOLOCK_SHARED. If it is set, we have
to drop the IOLOCK_SHARED and take it IOLOCK_EXCL. This will block
until the clone completes.

7. when the clone completes, we clear the "clone in progress" flag
and drop all the IOLOCKs that are held.

AFAICT, this will give us shared clone vs read and exclusive clone
vs write IO semantics for all clone operations. And if I've
understood the problem statement correctly, this will avoid the
read IO latency problems that long running clone operations cause
without needing a new syscall.

Thoughts?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx