[RFC] xfs: reduce sub-block DIO serialisation

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 12 Jan 2021 12:07:40 +1100

Hi folks,

This is the XFS implementation on the sub-block DIO optimisations
for written extents that I've mentioned on #xfs and a couple of
times now on the XFS mailing list.

It takes the approach of using the IOMAP_NOWAIT non-blocking
IO submission infrastructure to optimistically dispatch sub-block
DIO without exclusive locking. If the extent mapping callback
decides that it can't do the unaligned IO without extent
manipulation, sub-block zeroing, blocking or splitting the IO into
multiple parts, it aborts the IO with -EAGAIN. This allows the high
level filesystem code to then take exclusive locks and resubmit the
IO once it has guaranteed no other IO is in progress on the inode
(the current implementation).

This requires moving the IOMAP_NOWAIT setup decisions up into the
filesystem, adding yet another parameter to iomap_dio_rw(). So first
I convert iomap_dio_rw() to take an args structure so that we don't
have to modify the API every time we want to add another setup
parameter to the DIO submission code.

I then include Christophs IOCB_NOWAIT fxies and cleanups to the XFS
code, because they needed to be done regardless of the unaligned DIO
issues and they make the changes simpler. Then I split the unaligned
DIO path out from the aligned path, because all the extra complexity
to support better unaligned DIO submission concurrency is not
necessary for the block aligned path. Finally, I modify the
unaligned IO path to first submit the unaligned IO using
non-blocking semantics and provide a fallback to run the IO
exclusively if that fails.

This means that we consider sub-block dio into written a fast path
that should almost always succeed with minimal overhead and we put
all the overhead of failure into the slow path where exclusive
locking is required. Unlike Christoph's proposed patch, this means
we don't require an extra ILOCK cycle in the sub-block DIO setup
fast path, so it should perform almost identically to the block
aligned fast path.

Tested using fio with AIO+DIO randrw to a written file. Performance
increases from about 20k IOPS to 150k IOPS, which is the limit of
the setup I was using for testing. Also passed fstests auto group
on a both v4 and v5 XFS filesystems.

Thoughts, comments?

-Dave.