On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote: > Hi all, > > Darrick and I have been working on designing a new ioctl FICLONERANGE2. The > following text attempts to explain our needs and reasoning behind this decision. > > > Contents > -------- > 1. Problem Statement > 2. Proof of Concept > 3. Proposed Solution > 4. User Interface > 5. Testing Plan > > > 1. Problem Statement > -------------------- > > One of our VM cluster management products needs to snapshot KVM image files > so that they can be restored in case of failure. Snapshotting is done by > redirecting VM disk writes to a sidecar file and using reflink on the disk > image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink > locks the source and destination files while it operates, which means that > reads from the main vm disk image are blocked, causing the vm to stall. When > an image file is heavily fragmented, the copy process could take several > minutes. Some of the vm image files have 50-100 million extent records, and > duplicating that much metadata locks the file for 30 minutes or more. Having > activities suspended for such a long time in a cluster node could result in > node eviction. A node eviction occurs when the cluster manager determines > that the vm is unresponsive. One of the criteria for determining that a VM > is unresponsive is the failure of filesystems in the guest to respond for an > unacceptably long time. In order to solve this problem, we need to provide a > variant of FICLONE that releases the file locks periodically to allow reads > to occur as vmbackup runs. The purpose of this feature is to allow vmbackup > to run without causing downtime. Interesting problem to have - let me see if I understand it properly. Writes are redirected away from the file being cloned, but reads go directly to the source file being cloned? But cloning can take a long time, so breaking up the clone operation into multiple discrete ranges will allow reads through to the file being cloned with minimal latency. However, you don't want writes to the source file because that results in the atomicity of the clone operation being violated and corrupting the snapshot. Hence the redirected writes ensure that the file being cloned does not change from syscall to syscall. This means the time interrupted clone operation can restart from where it left off and you still get an consistent image clone for the snapshot. Did I get that right? If so, I'm wondering about the general usefulness of this multi-syscall construct - having to ensure that it isn't written to between syscalls is quite the constraint. I wonder if we can do better than that and not need a new syscall; shared read + clone seems more like an inode extent list access serialisation problem than anything else... <thinks for a bit> Ok. a clone does not change any data in the source file. Neither do read IO operations. Hence from a data integrity perspective, there's no reason why read IO and FICLONE can't run concurrently on the source file. Writes we still need to block so that the clone is an atomic point in time image of the file, but reads could be allowed. The XFS clone implementation takes the IOLOCK_EXCL high up, and then lower down it iterates one extent doing the sharing operation. It holds the ILOCK_EXCL while it is modifying the extent in both the source and destination files, then commits the transaction and drops the ILOCKs. OK, so we have fine-grained ILOCK serialisation during the clone for access/modification to the extent list. Excellent, I think we can make this work. So: 1. take IOLOCK_EXCL like we already do on the source and destination files. 2. Once all the pre work is done, set a "clone in progress" flag on the in-memory source inode. 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED. 4. read IO and the clone serialise access to the extent list via the ILOCK. We know this works fine, because that's how the extent list access serialisation for concurrent read and write direct IO works. 5. buffered writes take the IOLOCK_EXCL, so they block until the clone completes. Same behaviour as right now, all good. 6. direct IO writes need to be modified to check the "clone in progress" flag after taking the IOLOCK_SHARED. If it is set, we have to drop the IOLOCK_SHARED and take it IOLOCK_EXCL. This will block until the clone completes. 7. when the clone completes, we clear the "clone in progress" flag and drop all the IOLOCKs that are held. AFAICT, this will give us shared clone vs read and exclusive clone vs write IO semantics for all clone operations. And if I've understood the problem statement correctly, this will avoid the read IO latency problems that long running clone operations cause without needing a new syscall. Thoughts? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx