Re: [PATCH v7] block: Improve IOPS by removing the fairness code

Keith Busch <kbusch@xxxxxxxxxx> · Thu, 30 May 2024 14:47:26 -0600

On Wed, May 29, 2024 at 02:39:20PM -0700, Bart Van Assche wrote:
> There is an algorithm in the block layer for maintaining fairness
> across queues that share a tag set. The sbitmap implementation has
> improved so much that we don't need the block layer fairness algorithm
> anymore and that we can rely on the sbitmap implementation to guarantee
> fairness.
> 
> On my test setup (x86 VM with 72 CPU cores) this patch results in 2.9% more
> IOPS. IOPS have been measured as follows:
> 
> $ modprobe null_blk nr_devices=1 completion_nsec=0
> $ fio --bs=4096 --disable_clat=1 --disable_slat=1 --group_reporting=1 \
>       --gtod_reduce=1 --invalidate=1 --ioengine=psync --ioscheduler=none \
>       --norandommap --runtime=60 --rw=randread --thread --time_based=1 \
>       --buffered=0 --numjobs=64 --name=/dev/nullb0 --filename=/dev/nullb0
> 
> Additionally, it has been verified as follows that all request queues that
> share a tag set process I/O even if the completion times are different (see
> also test block/035):
> - Create a first request queue with completion time 1 ms and queue
>   depth 64.
> - Create a second request queue with completion time 100 ms and that
>   shares the tag set of the first request queue.
> - Submit I/O to both request queues with fio.
> 
> Tests have shown that the IOPS for this test case are 29859 and 318 or a
> ratio of about 94. This ratio is close to the completion time ratio.
> While this is unfair, both request queues make progress at a consistent
> pace.

I suggested running a more lopsided workload on a high contention tag
set: here's an example fio profile to exaggerate this:

---
[global]
rw=randread
direct=1
ioengine=io_uring
time_based
runtime=60
ramp_time=10

[zero]
bs=131072
filename=/dev/nvme0n1
iodepth=256
iodepth_batch_submit=64
iodepth_batch_complete=64

[one]
bs=512
filename=/dev/nvme0n2
iodepth=1
--

My test nvme device has 2 namespaces, 1 IO queue, and only 63 tags.

Without your patch:

  zero: (groupid=0, jobs=1): err= 0: pid=465: Thu May 30 13:29:43 2024
    read: IOPS=14.0k, BW=1749MiB/s (1834MB/s)(103GiB/60002msec)
       lat (usec): min=2937, max=40980, avg=16990.33, stdev=1732.37
  ...
  one: (groupid=0, jobs=1): err= 0: pid=466: Thu May 30 13:29:43 2024
    read: IOPS=2726, BW=1363KiB/s (1396kB/s)(79.9MiB/60001msec)
       lat (usec): min=45, max=4859, avg=327.52, stdev=335.25

With your patch:

  zero: (groupid=0, jobs=1): err= 0: pid=341: Thu May 30 13:36:26 2024
    read: IOPS=14.8k, BW=1852MiB/s (1942MB/s)(109GiB/60004msec)
       lat (usec): min=3103, max=26191, avg=16322.77, stdev=1138.04
  ...
  one: (groupid=0, jobs=1): err= 0: pid=342: Thu May 30 13:36:26 2024
    read: IOPS=1841, BW=921KiB/s (943kB/s)(54.0MiB/60001msec)
       lat (usec): min=51, max=5935, avg=503.81, stdev=608.41

So there's definitely a difference here that harms the lesser used
device for a modest gain on the higher demanding device. Does it matter?
I really don't know if I can answer that. It's just different is all I'm
saying.