Re: [PATCH 0/1] iomap regression for aio dio 4k writes

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 22 Jun 2023 09:59:29 +1000

On Wed, Jun 21, 2023 at 10:29:19AM -0700, Jeremy Bongio wrote:
> Hi Darrick and Allison,
> 
> There has been a standing performance regression involving AIO DIO
> 4k-aligned writes on ext4 backed by a fast local SSD since the switch
> to iomap. I think it was originally reported and investigated in this
> thread: https://lore.kernel.org/all/87lf7rkffv.fsf@xxxxxxxxxxxxx/
> 
> Short version:
> Pre-iomap, for ext4 async direct writes, after the bio is written to disk
> the completion function is called directly during the endio stage.
> 
> Post-iomap, for direct writes, after the bio is written to disk, the completion
> function is deferred to a work queue. This adds latency that impacts
> performance most noticeably in very fast SSDs.
> 
> Detailed version:
> A possible explanation is below, followed by a few questions to figure
> out the right way to fix it.
> 
> In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has completed
> in the nvme driver, the interrupt handler for the write request ends
> in calling bio_endio() which ends up calling dio_bio_end_aio(). A
> different end_io function is used for async and sync io. If there are
> no pages mapped in memory for the write operation's inode, then the
> completion function for ext4 is called directly. If there are pages
> mapped, then they might be dirty and need to be updated and work
> is deferred to a work queue.
> 
> Here is the relevant 4.15 code:
> 
> fs/direct-io.c: dio_bio_end_aio()
> if (dio->result)
>         defer_completion = dio->defer_completion ||
>                            (dio_op == REQ_OP_WRITE &&
>                            dio->inode->i_mapping->nrpages);
> if (defer_completion) {
>         INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>         queue_work(dio->inode->i_sb->s_dio_done_wq,
>                    &dio->complete_work);
> } else {
>         dio_complete(dio, 0, DIO_COMPLETE_ASYNC);
> }
> 
> After ext4 switched to using iomap, the endio function became
> iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end io
> function is used for both async and sync io.

Yes, because the IO completion processing is the same regardless of
whether the IO is submitted for sync or async completion.

> All write requests will
> defer io completion to a work queue even if there are no mapped pages
> for the inode.

Yup.

Consider O_DSYNC DIO writes: Where are the post-IO completion
integrity operations done?

Consider DIO write IO completion for different filesystems: how does
iomap_dio_complete() know whether dio->dops->end_io() needs to run
in task context or not.

e.g. DIO writes into unwritten extents: where are the written
conversion transactions run?

> With the attached patch, I see significantly better performance in 5.10 than 4.15. 5.10 is the latest kernel where I have driver support for an SSD that is fast enough to reproduce the regression. I verified that upstream iomap works the same.
> 
> Test results using the reproduction script from the original report
> and testing with 4k/8k/12k/16k blocksizes and write-only:
> https://people.collabora.com/~krisman/dio/week21/bench.sh
> 
> fio benchmark command:
> fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file --iodepth=64 \
> --time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \

Ah, you are testing pure overwrites, which means for ext4 the only
thing it needs to care about is cached mappings. What happens when
you add O_DSYNC here?

> --name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \
> --runtime=100 --output-format=terse >> ${LOG}
> 
> For 4.15, with all write completions called in io handler:
> 4k:  bw=1056MiB/s
> 8k:  bw=2082MiB/s
> 12k: bw=2332MiB/s
> 16k: bw=2453MiB/s
> 
> For unmodified 5.10, with all write completions deferred:
> 4k:  bw=1004MiB/s
> 8k:  bw=2074MiB/s
> 12k: bw=2309MiB/s
> 16k: bw=2465MiB/s

I don't see a regression here - the differences are in the noise of
a typical fio overwrite test.

> For modified 5.10, with all write completions called in io handler:
> 4k:  bw=1193MiB/s
> 8k:  bw=2258MiB/s
> 12k: bw=2346MiB/s
> 16k: bw=2446MiB/s
>
> Questions:
> 
> Why did iomap from the beginning not make the async/sync io and
> mapped/unmapped distinction that fs/direct-io.c did?

Because the iomap code was designed from the ground up as an
extent-based concurrent async IO engine that supported concurrent
reads and writes to the same sparse file with full data integrity
handling guarantees.

The old dio code started off as "sync" DIO only, and it was for a
long time completely broken for async DIO. Correct support for that
was eventually tacked on via DIO_COMPLETE_ASYNC, but it was still
basically an awful, nasty, complex, "block at a time" synchronous
IO engine.

> Since no issues have been found for ext4 calling completion work
> directly in the io handler pre-iomap, it is unlikely that this is
> unsafe (sleeping within an io handler callback). However, this may not
> be true for all filesystems. Does XFS potentially sleep in its
> completion code?

Yes, and ext4 does too. e.g. O_DSYNC overwrites always need to be
deferred to task context to be able to take sleeping locks and
potentially block on journal and or device cache flushes.

i.e. Have you considered what context all of XFS, f2fs, btrfs,
zonefs and gfs2 need for pure DIO overwrite completion in all it's
different variants?

AFAIC, it's far simpler conceptually to defer all writes to
completion context than it is to try to work out what writes need to
be deferred and what doesn't, especially as the filesystem ->end_io
completion might need to sleep and the iomap code has no idea
whether that is possible.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx