Merging raw block device writes

Michael Kelley <mhklinux@xxxxxxxxxxx> · Sat, 25 Nov 2023 17:38:28 +0000

About 18 months ago, I raised an issue with merging of direct raw block
device writes. [1]   At the time, nobody offered any input on the issue.
To recap, when multiple direct write requests are in-flight to a raw block
device with I/O scheduler "none", and requests must wait for budget
(i.e., the device is SCSI), write requests don't go on a blk-mq software
queue, and no merging is done.  Direct read requests that must wait for
budget *do* go on a software queue and merges happen.

Recently, I noticed that the problem has been fixed in the latest
upstream kernel, and I had time to do further investigation on the
issue.  Bisecting shows the problem first occurred in 5.16-rc1 with
commit dc5fc361d891 from Jens Axboe.  This commit actually prevents
merging of both reads and writes.  But reads were indirectly fixed in
commit 54a88eb838d3 from Pavel Begunkov, also in 5.16-rc1, so
the read problem never occurred in a release.   There's no mention
of merging in either commit message, so I suspect the effect on
merging was unintentional in both cases.   In 5.16, blkdev_read_iter()
does not create a plug list, while blkdev_write_iter() does.  But the
lower level __blkdev_direct_IO() creates a plug list for both reads
and writes, which is why commit dc5fc361d891 broke both.  Then
commit 54a88eb838d3 bypassed __blkdev_direct_IO() in most
cases, and the new path does not create a plug list.  So reads
typically proceed without a plug list, and the merging can happen.
Writes still don't merge because of the plug list in the higher level
blkdev_write_iter().

The situation stayed that way until 6.5-rc1 when commit
712c7364655f from Christoph removed the plug list from
blkdev_write_iter().  Again, there's no mention of merging in the
commit message, so fixing the merge problem may be happenstance.

Hyper-V guests and the Azure cloud have a particular interest here
because Hyper-V guests uses SCSI as the standard interface to virtual
disks.  Azure cloud disks can be throttled to a limited number of IOPS,
so the number of in-flights I/Os can be relatively high, and
merging can be beneficial to staying within the throttle
limits.  Of the flip side, this problem hasn't generated complaints
over the last 18 months that I'm aware of, though that may be more
because commercial distros haven't been running 5.16 or later kernels
until relatively recently.

In any case, the 6.5 kernel fixes the problem, at least in the
common cases where there's no plug list.  But I still wonder if
there's a latent problem with the original commit dc5fc361d891
that should be looked at by someone with more blk-mq expertise
than I have.

Michael

[1] https://lore.kernel.org/linux-block/PH0PR21MB3025A7D1326A92A4B8BDB5FED7B59@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/