Re: [PATCH 0/1] iomap regression for aio dio 4k writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2023-06-21 at 10:29 -0700, Jeremy Bongio wrote:
> Hi Darrick and Allison,
> 
> There has been a standing performance regression involving AIO DIO
> 4k-aligned writes on ext4 backed by a fast local SSD since the switch
> to iomap. I think it was originally reported and investigated in this
> thread:
> https://urldefense.com/v3/__https://lore.kernel.org/all/87lf7rkffv.fsf@xxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CQMgHZyQ$
>  
> 
> Short version:
> Pre-iomap, for ext4 async direct writes, after the bio is written to
> disk
> the completion function is called directly during the endio stage.
> 
> Post-iomap, for direct writes, after the bio is written to disk, the
> completion
> function is deferred to a work queue. This adds latency that impacts
> performance most noticeably in very fast SSDs.
> 
> Detailed version:
> A possible explanation is below, followed by a few questions to
> figure
> out the right way to fix it.
> 
> In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has
> completed
> in the nvme driver, the interrupt handler for the write request ends
> in calling bio_endio() which ends up calling dio_bio_end_aio(). A
> different end_io function is used for async and sync io. If there are
> no pages mapped in memory for the write operation's inode, then the
> completion function for ext4 is called directly. If there are pages
> mapped, then they might be dirty and need to be updated and work
> is deferred to a work queue.
> 
> Here is the relevant 4.15 code:
> 
> fs/direct-io.c: dio_bio_end_aio()
> if (dio->result)
>         defer_completion = dio->defer_completion ||
>                            (dio_op == REQ_OP_WRITE &&
>                            dio->inode->i_mapping->nrpages);
> if (defer_completion) {
>         INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>         queue_work(dio->inode->i_sb->s_dio_done_wq,
>                    &dio->complete_work);
> } else {
>         dio_complete(dio, 0, DIO_COMPLETE_ASYNC);
> }
> 
> After ext4 switched to using iomap, the endio function became
> iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end
> io
> function is used for both async and sync io. All write requests will
> defer io completion to a work queue even if there are no mapped pages
> for the inode.
> 
> Here is the relevant code in latest kernel (post-iomap) ...
> 
> fs/iomap/direct-io.c: iomap_dio_bio_end_io()
> if (dio->wait_for_completion) {
>           struct task_struct *waiter = dio->submit.waiter;
>           WRITE_ONCE(dio->submit.waiter, NULL);
>           blk_wake_io_task(waiter);
>    } else if (dio->flags & IOMAP_DIO_WRITE) {
>          struct inode *inode = file_inode(dio->iocb->ki_filp);
> 
>          WRITE_ONCE(dio->iocb->private, NULL);
>          INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
>          queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
>    } else {
>          WRITE_ONCE(dio->iocb->private, NULL);
>          iomap_dio_complete_work(&dio->aio.work);
> }
> 
> With the attached patch, I see significantly better performance in
> 5.10 than 4.15. 5.10 is the latest kernel where I have driver support
> for an SSD that is fast enough to reproduce the regression. I
> verified that upstream iomap works the same.
> 
> Test results using the reproduction script from the original report
> and testing with 4k/8k/12k/16k blocksizes and write-only:
> https://urldefense.com/v3/__https://people.collabora.com/*krisman/dio/week21/bench.sh__;fg!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CpnhVXfg$
>  
> 
> fio benchmark command:
> fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file --
> iodepth=64 \
> --time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \
> --name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \
> --runtime=100 --output-format=terse >> ${LOG}
> 
> For 4.15, with all write completions called in io handler:
> 4k:  bw=1056MiB/s
> 8k:  bw=2082MiB/s
> 12k: bw=2332MiB/s
> 16k: bw=2453MiB/s
> 
> For unmodified 5.10, with all write completions deferred:
> 4k:  bw=1004MiB/s
> 8k:  bw=2074MiB/s
> 12k: bw=2309MiB/s
> 16k: bw=2465MiB/s
> 
> For modified 5.10, with all write completions called in io handler:
> 4k:  bw=1193MiB/s
> 8k:  bw=2258MiB/s
> 12k: bw=2346MiB/s
> 16k: bw=2446MiB/s
> 
> Questions:
> 
> Why did iomap from the beginning not make the async/sync io and
> mapped/unmapped distinction that fs/direct-io.c did?
> 
> Since no issues have been found for ext4 calling completion work
> directly in the io handler pre-iomap, it is unlikely that this is
> unsafe (sleeping within an io handler callback). However, this may
> not
> be true for all filesystems. Does XFS potentially sleep in its
> completion code?
> 
> Allison, Ted mentioned you saw a similar problem when doing
> performance testing for the latest version of Unbreakable Linux. Can
> you test/verify this patch addresses your performance regression as
> well?
Hi Jeremy, 

Sure I can link this patch to the bug report to see if the tester sees
any improvements.  If anything, I think it's a good data point to have,
even if we are still considering other solutions.  Thanks!

Allison

> 
> Jeremy Bongio (1):
>   For DIO writes with no mapped pages for inode, skip deferring
>     completion.
> 
>  fs/iomap/direct-io.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux