Re: XFS reflink overhead, ioctl(FICLONE)

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 15 Dec 2022 11:19:44 +1100

On Tue, Dec 13, 2022 at 08:47:03PM -0800, Suyash Mahar wrote:
> Hi Darrick,
> 
> Thank you for the response. I have replied inline.
> 
> -Suyash
> 
> Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong <djwong@xxxxxxxxxx> a écrit :
> >
> > [ugh, your email never made it to the list.  I bet the email security
> > standards have been tightened again.  <insert rant about dkim and dmarc
> > silent failures here>] :(
> >
> > On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote:
> > > Hi all!
> > >
> > > While using XFS's ioctl(FICLONE), we found that XFS seems to have
> > > poor performance (ioctl takes milliseconds for sparse files) and the
> > > overhead
> > > increases with every call.
> > >
> > > For the demo, we are using an Optane DC-PMM configured as a
> > > block device (fsdax) and running XFS (Linux v5.18.13).
> >
> > How are you using fsdax and reflink on a 5.18 kernel?  That combination
> > of features wasn't supported until 6.0, and the data corruption problems
> > won't get fixed until a pull request that's about to happen for 6.2.
> 
> We did not enable the dax option. The optane DIMMs are configured to
> appear as a block device.
> 
> $ mount | grep xfs
> /dev/pmem0p4 on /mnt/pmem0p4 type xfs
> (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> 
> Regardless of the block device (the plot includes results for optane
> and RamFS), it seems like the ioctl(FICLONE) call is slow.

Please define "slow" - is it actually slower than it should be
(i.e. a bug) or does it simply not perform according to your
expectations?

A few things that you can quantify to answer these questions.

1. What is the actual rate it is cloning extents at? i.e. extent count
/ clone time?  Is this rate consistent/sustained, or is it dropping substantially
over time and/or increase in extent count?

3. How does clone speed of a given file compare to the actual data
copy speed of that file (please include fsync time in the data
copy results)? Is cloning faster or slower than copying
the data? What is the extent count of the file at the cross-over
point where cloning goes from being faster to slower than copying
the data?

3. How does it compare with btrfs running the same write/clone
workload? Does btrfs run faster? Does it perform better with
high extent counts than XFS? What about with high sharing counts
(e.g. after 500 or 1000 clones of the source file)?

Basically, I'm trying to understand what "slow" means in teh context
of the operations you are performing.  I haven't seen any recent
performance regressions in clone speed on XFS, so I'm trying to
understand what you are seeing and why you think it is slower than
it should be.

> > > We create a 1 GiB dense file, then repeatedly modify a tiny random
> > > fraction of it and make a clone via ioctl(FICLONE).
> >
> > Yay, random cow writes, that will slowly increase the number of space
> > mapping records in the file metadata.

Yup, the scripts I use do exactly this - 10,000 random 4kB writes to
a 8GB file between reflink clones. I then iterate a few thousand
times and measure the reflink time.

> > > The time required for the ioctl() calls increases from large to insane
> > > over the course of ~250 iterations: From roughly a millisecond for the
> > > first iteration or two (which seems high, given that this is on
> > > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20
> > > milliseconds (which seems crazy).
> >
> > Does the system call runtime increase with O(number_extents)?  You might
> > record the number of extents in the file you're cloning by running this
> > periodically:
> >
> > xfs_io -c stat $path | grep fsxattr.nextents
> 
> The extent count does increase linearly (just like the ioctl() call latency).

As expected. Changing the sharing state a single extent has a
roughly constant overhead regardless of the number of extents in the
file. Hence clone time should scale linearly with the number of
extents that need to have their shared state modified.

> I used the xfs_bmap tool, let me know if this is not the right way. If
> it is not, I'll update the microbenchmark to run xfs_io.

xfs_bmap is the slow way - it has to iterate every extents and
format them out to userspace. the above mechanism just does a single
syscall to query the count of extents from the inode. Using the
fsxattr extent count query is much faster, especially when you have
files with tens of millions of extents in them....

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx