Re: [PATCHv3] io_uring: set plug tags for same file

Jens Axboe <axboe@xxxxxxxxx> · Fri, 11 Aug 2023 13:24:17 -0600

On 7/31/23 2:39 PM, Keith Busch wrote:
> From: Keith Busch <kbusch@xxxxxxxxxx>
> 
> io_uring tries to optimize allocating tags by hinting to the plug how
> many it expects to need for a batch instead of allocating each tag
> individually. But io_uring submission queueus may have a mix of many
> devices for io, so the number of io's counted may be overestimated. This
> can lead to allocating too many tags, which adds overhead to finding
> that many contiguous tags, freeing up the ones we didn't use, and may
> starve out other users that can actually use them.
> 
> When starting a new batch of uring commands, count only commands that
> match the file descriptor of the first seen for this optimization. This
> avoids have to call the unlikely "blk_mq_free_plug_rqs()" at the end of
> a submission when multiple devices are used in a batch.

Wanted to run this through both the peak IOPS and networking testing,
started with the former. Here's a peak run with -git + pending 6.5
changes + pending 6.6 changes:

IOPS=125.88M, BW=61.46GiB/s, IOS/call=32/31
IOPS=125.39M, BW=61.23GiB/s, IOS/call=31/31
IOPS=124.97M, BW=61.02GiB/s, IOS/call=32/32
IOPS=124.60M, BW=60.84GiB/s, IOS/call=32/32
IOPS=124.27M, BW=60.68GiB/s, IOS/call=31/31
IOPS=124.00M, BW=60.54GiB/s, IOS/call=32/31

and here's one with the patch:

IOPS=121.69M, BW=59.42GiB/s, IOS/call=31/32
IOPS=121.26M, BW=59.21GiB/s, IOS/call=32/32
IOPS=120.87M, BW=59.02GiB/s, IOS/call=31/31
IOPS=120.87M, BW=59.02GiB/s, IOS/call=32/32
IOPS=121.02M, BW=59.09GiB/s, IOS/call=32/32
IOPS=121.63M, BW=59.39GiB/s, IOS/call=31/32
IOPS=121.48M, BW=59.32GiB/s, IOS/call=31/31

Running a quick profile, here's the top diff:

# Baseline  Delta Abs  Shared Object     Symbol                                     
# ........  .........  ................  ...........................................
#
     6.69%     -3.03%  [kernel.vmlinux]  [k] io_issue_sqe
     9.65%     +2.30%  [nvme]            [k] nvme_poll_cq
     4.86%     -1.55%  [kernel.vmlinux]  [k] io_submit_sqes
     4.61%     +1.40%  [kernel.vmlinux]  [k] blk_mq_submit_bio
     4.79%     -0.98%  [kernel.vmlinux]  [k] io_read
     0.56%     +0.97%  [kernel.vmlinux]  [k] blkdev_dio_unaligned.isra.0
     3.61%     +0.52%  [kernel.vmlinux]  [k] dma_unmap_page_attrs
     2.04%     -0.45%  [kernel.vmlinux]  [k] blk_add_rq_to_plug

Note that this is perf.data.old being the kernel with your patch, and
perf.data being the "stock" kernel mentioned above. The main thing looks
like spending more time in io_issue_sqe() and io_submit_sqes(), and
converserly we're spending less time polling. Usually for profiling a
polled workload, having more time in the polling function is good, as it
shows us we're spending less time everywhere else.

This is what I'm using:

sudo t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R0 -X1 -n24 -P1 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme12n1 /dev/nvme13n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme19n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1

which is submitting 32 requests at the time. Obviously we don't expect a
win in this case, as each thread is handling just a single NVMe device.
The stock kernel will not over-allocate in this case.

If we change that to -n12 instead, meaning each will drive two devices,
here's what the stock kernel gets:

IOPS=60.95M, BW=29.76GiB/s, IOS/call=31/32
IOPS=60.99M, BW=29.78GiB/s, IOS/call=31/32
IOPS=60.96M, BW=29.77GiB/s, IOS/call=31/31
IOPS=60.95M, BW=29.76GiB/s, IOS/call=31/31
IOPS=60.91M, BW=29.74GiB/s, IOS/call=32/32

and with the patch:

IOPS=59.64M, BW=29.12GiB/s, IOS/call=32/31
IOPS=59.63M, BW=29.12GiB/s, IOS/call=31/32
IOPS=59.57M, BW=29.09GiB/s, IOS/call=31/31
IOPS=59.57M, BW=29.09GiB/s, IOS/call=32/32
IOPS=59.65M, BW=29.12GiB/s, IOS/call=31/31

Now these are both obviously lower, but I haven't done anything to
ensure that the two-drives-per-poller workload is optimized. For all I
know, the numa layout is now messed up too. Just as a caveat, but they
are comparable to each other.

Perf diff again looks similar, note that this time it's perf.data.old
that's the stock kernel and perf.data that's the one with your patch:

     3.51%     +2.84%  [kernel.vmlinux]  [k] io_issue_sqe
     3.24%     +1.35%  [kernel.vmlinux]  [k] io_submit_sqes

With the kernel without your patch, I was looking for tag flush overhead
but didn't find much:

     0.02%  io_uring  [kernel.vmlinux]  [k] blk_mq_free_plug_rqs

Outside of the peak worry with the patch, do you have a workload that we
should test this on?

-- 
Jens Axboe