On Wed, 2023-06-21 at 10:29 -0700, Jeremy Bongio wrote: > Hi Darrick and Allison, > > There has been a standing performance regression involving AIO DIO > 4k-aligned writes on ext4 backed by a fast local SSD since the switch > to iomap. I think it was originally reported and investigated in this > thread: > https://urldefense.com/v3/__https://lore.kernel.org/all/87lf7rkffv.fsf@xxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CQMgHZyQ$ > > > Short version: > Pre-iomap, for ext4 async direct writes, after the bio is written to > disk > the completion function is called directly during the endio stage. > > Post-iomap, for direct writes, after the bio is written to disk, the > completion > function is deferred to a work queue. This adds latency that impacts > performance most noticeably in very fast SSDs. > > Detailed version: > A possible explanation is below, followed by a few questions to > figure > out the right way to fix it. > > In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has > completed > in the nvme driver, the interrupt handler for the write request ends > in calling bio_endio() which ends up calling dio_bio_end_aio(). A > different end_io function is used for async and sync io. If there are > no pages mapped in memory for the write operation's inode, then the > completion function for ext4 is called directly. If there are pages > mapped, then they might be dirty and need to be updated and work > is deferred to a work queue. > > Here is the relevant 4.15 code: > > fs/direct-io.c: dio_bio_end_aio() > if (dio->result) > defer_completion = dio->defer_completion || > (dio_op == REQ_OP_WRITE && > dio->inode->i_mapping->nrpages); > if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > } else { > dio_complete(dio, 0, DIO_COMPLETE_ASYNC); > } > > After ext4 switched to using iomap, the endio function became > iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end > io > function is used for both async and sync io. All write requests will > defer io completion to a work queue even if there are no mapped pages > for the inode. > > Here is the relevant code in latest kernel (post-iomap) ... > > fs/iomap/direct-io.c: iomap_dio_bio_end_io() > if (dio->wait_for_completion) { > struct task_struct *waiter = dio->submit.waiter; > WRITE_ONCE(dio->submit.waiter, NULL); > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > struct inode *inode = file_inode(dio->iocb->ki_filp); > > WRITE_ONCE(dio->iocb->private, NULL); > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > } else { > WRITE_ONCE(dio->iocb->private, NULL); > iomap_dio_complete_work(&dio->aio.work); > } > > With the attached patch, I see significantly better performance in > 5.10 than 4.15. 5.10 is the latest kernel where I have driver support > for an SSD that is fast enough to reproduce the regression. I > verified that upstream iomap works the same. > > Test results using the reproduction script from the original report > and testing with 4k/8k/12k/16k blocksizes and write-only: > https://urldefense.com/v3/__https://people.collabora.com/*krisman/dio/week21/bench.sh__;fg!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CpnhVXfg$ > > > fio benchmark command: > fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file -- > iodepth=64 \ > --time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \ > --name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \ > --runtime=100 --output-format=terse >> ${LOG} > > For 4.15, with all write completions called in io handler: > 4k: bw=1056MiB/s > 8k: bw=2082MiB/s > 12k: bw=2332MiB/s > 16k: bw=2453MiB/s > > For unmodified 5.10, with all write completions deferred: > 4k: bw=1004MiB/s > 8k: bw=2074MiB/s > 12k: bw=2309MiB/s > 16k: bw=2465MiB/s > > For modified 5.10, with all write completions called in io handler: > 4k: bw=1193MiB/s > 8k: bw=2258MiB/s > 12k: bw=2346MiB/s > 16k: bw=2446MiB/s > > Questions: > > Why did iomap from the beginning not make the async/sync io and > mapped/unmapped distinction that fs/direct-io.c did? > > Since no issues have been found for ext4 calling completion work > directly in the io handler pre-iomap, it is unlikely that this is > unsafe (sleeping within an io handler callback). However, this may > not > be true for all filesystems. Does XFS potentially sleep in its > completion code? > > Allison, Ted mentioned you saw a similar problem when doing > performance testing for the latest version of Unbreakable Linux. Can > you test/verify this patch addresses your performance regression as > well? Hi Jeremy, Sure I can link this patch to the bug report to see if the tester sees any improvements. If anything, I think it's a good data point to have, even if we are still considering other solutions. Thanks! Allison > > Jeremy Bongio (1): > For DIO writes with no mapped pages for inode, skip deferring > completion. > > fs/iomap/direct-io.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) >