On Tue, Oct 17, 2023 at 11:08:14PM -0700, Christoph Hellwig wrote: > On Tue, Oct 17, 2023 at 01:12:08PM -0700, Catherine Hoang wrote: > > One of our VM cluster management products needs to snapshot KVM image > > files so that they can be restored in case of failure. Snapshotting is > > done by redirecting VM disk writes to a sidecar file and using reflink > > on the disk image, specifically the FICLONE ioctl as used by > > "cp --reflink". Reflink locks the source and destination files while it > > operates, which means that reads from the main vm disk image are blocked, > > causing the vm to stall. When an image file is heavily fragmented, the > > copy process could take several minutes. Some of the vm image files have > > 50-100 million extent records, and duplicating that much metadata locks > > the file for 30 minutes or more. Having activities suspended for such > > a long time in a cluster node could result in node eviction. > > > > Clone operations and read IO do not change any data in the source file, > > so they should be able to run concurrently. Demote the exclusive locks > > taken by FICLONE to shared locks to allow reads while cloning. While a > > clone is in progress, writes will take the IOLOCK_EXCL, so they block > > until the clone completes. > > Sorry for being pesky, but do you have some rough numbers on how > much this actually with the above workload? Well... the stupid answer is that I augmented generic/176 to try to race buffered and direct reads with cloning a million extents and print out when the racing reads completed. On an unpatched kernel, the reads don't complete until the reflink does: --- /tmp/fstests/tests/generic/176.out 2023-07-11 12:18:21.617971250 -0700 +++ /var/tmp/fstests/generic/176.out.bad 2023-10-19 10:22:04.771017812 -0700 @@ -2,3 +2,8 @@ Format and mount Create a many-block file Reflink the big file +start reflink Thu Oct 19 10:19:19 PDT 2023 +end reflink Thu Oct 19 10:20:06 PDT 2023 +buffered read ioend Thu Oct 19 10:20:06 PDT 2023 +direct read ioend Thu Oct 19 10:20:06 PDT 2023 +finished waiting Thu Oct 19 10:20:06 PDT 2023 Yowza, a minute's worth of read latency! On a patched kernel, the reads complete while the clone is running: --- /tmp/fstests/tests/generic/176.out 2023-07-11 12:18:21.617971250 -0700 +++ /var/tmp/fstests/generic/176.out.bad 2023-10-19 10:22:25.528685643 -0700 @@ -2,3 +2,552 @@ Format and mount Create a many-block file Reflink the big file +start reflink Thu Oct 19 10:19:24 PDT 2023 +buffered read ioend Thu Oct 19 10:19:24 PDT 2023 +direct read ioend Thu Oct 19 10:19:24 PDT 2023 +buffered read ioend Thu Oct 19 10:19:24 PDT 2023 +direct read ioend Thu Oct 19 10:19:24 PDT 2023 +buffered read ioend Thu Oct 19 10:19:24 PDT 2023 +buffered read ioend Thu Oct 19 10:19:24 PDT 2023 +buffered read ioend Thu Oct 19 10:19:25 PDT 2023 +buffered read ioend Thu Oct 19 10:19:25 PDT 2023 +direct read ioend Thu Oct 19 10:19:25 PDT 2023 ... +buffered read ioend Thu Oct 19 10:20:06 PDT 2023 +buffered read ioend Thu Oct 19 10:20:07 PDT 2023 +buffered read ioend Thu Oct 19 10:20:07 PDT 2023 +direct read ioend Thu Oct 19 10:20:07 PDT 2023 +buffered read ioend Thu Oct 19 10:20:07 PDT 2023 +buffered read ioend Thu Oct 19 10:20:07 PDT 2023 +buffered read ioend Thu Oct 19 10:20:07 PDT 2023 +end reflink Thu Oct 19 10:20:07 PDT 2023 +direct read ioend Thu Oct 19 10:20:07 PDT 2023 +finished waiting Thu Oct 19 10:20:07 PDT 2023 So as you can see, reads from the reflink source file no longer experience a giant latency spike. I also wrote an fstest to check this behavior; I'll attach it as a separate reply. > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@xxxxxx> Thanks! --D