On 7/31/23 2:39 PM, Keith Busch wrote: > From: Keith Busch <kbusch@xxxxxxxxxx> > > io_uring tries to optimize allocating tags by hinting to the plug how > many it expects to need for a batch instead of allocating each tag > individually. But io_uring submission queueus may have a mix of many > devices for io, so the number of io's counted may be overestimated. This > can lead to allocating too many tags, which adds overhead to finding > that many contiguous tags, freeing up the ones we didn't use, and may > starve out other users that can actually use them. > > When starting a new batch of uring commands, count only commands that > match the file descriptor of the first seen for this optimization. This > avoids have to call the unlikely "blk_mq_free_plug_rqs()" at the end of > a submission when multiple devices are used in a batch. Wanted to run this through both the peak IOPS and networking testing, started with the former. Here's a peak run with -git + pending 6.5 changes + pending 6.6 changes: IOPS=125.88M, BW=61.46GiB/s, IOS/call=32/31 IOPS=125.39M, BW=61.23GiB/s, IOS/call=31/31 IOPS=124.97M, BW=61.02GiB/s, IOS/call=32/32 IOPS=124.60M, BW=60.84GiB/s, IOS/call=32/32 IOPS=124.27M, BW=60.68GiB/s, IOS/call=31/31 IOPS=124.00M, BW=60.54GiB/s, IOS/call=32/31 and here's one with the patch: IOPS=121.69M, BW=59.42GiB/s, IOS/call=31/32 IOPS=121.26M, BW=59.21GiB/s, IOS/call=32/32 IOPS=120.87M, BW=59.02GiB/s, IOS/call=31/31 IOPS=120.87M, BW=59.02GiB/s, IOS/call=32/32 IOPS=121.02M, BW=59.09GiB/s, IOS/call=32/32 IOPS=121.63M, BW=59.39GiB/s, IOS/call=31/32 IOPS=121.48M, BW=59.32GiB/s, IOS/call=31/31 Running a quick profile, here's the top diff: # Baseline Delta Abs Shared Object Symbol # ........ ......... ................ ........................................... # 6.69% -3.03% [kernel.vmlinux] [k] io_issue_sqe 9.65% +2.30% [nvme] [k] nvme_poll_cq 4.86% -1.55% [kernel.vmlinux] [k] io_submit_sqes 4.61% +1.40% [kernel.vmlinux] [k] blk_mq_submit_bio 4.79% -0.98% [kernel.vmlinux] [k] io_read 0.56% +0.97% [kernel.vmlinux] [k] blkdev_dio_unaligned.isra.0 3.61% +0.52% [kernel.vmlinux] [k] dma_unmap_page_attrs 2.04% -0.45% [kernel.vmlinux] [k] blk_add_rq_to_plug Note that this is perf.data.old being the kernel with your patch, and perf.data being the "stock" kernel mentioned above. The main thing looks like spending more time in io_issue_sqe() and io_submit_sqes(), and converserly we're spending less time polling. Usually for profiling a polled workload, having more time in the polling function is good, as it shows us we're spending less time everywhere else. This is what I'm using: sudo t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R0 -X1 -n24 -P1 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme12n1 /dev/nvme13n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme19n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 which is submitting 32 requests at the time. Obviously we don't expect a win in this case, as each thread is handling just a single NVMe device. The stock kernel will not over-allocate in this case. If we change that to -n12 instead, meaning each will drive two devices, here's what the stock kernel gets: IOPS=60.95M, BW=29.76GiB/s, IOS/call=31/32 IOPS=60.99M, BW=29.78GiB/s, IOS/call=31/32 IOPS=60.96M, BW=29.77GiB/s, IOS/call=31/31 IOPS=60.95M, BW=29.76GiB/s, IOS/call=31/31 IOPS=60.91M, BW=29.74GiB/s, IOS/call=32/32 and with the patch: IOPS=59.64M, BW=29.12GiB/s, IOS/call=32/31 IOPS=59.63M, BW=29.12GiB/s, IOS/call=31/32 IOPS=59.57M, BW=29.09GiB/s, IOS/call=31/31 IOPS=59.57M, BW=29.09GiB/s, IOS/call=32/32 IOPS=59.65M, BW=29.12GiB/s, IOS/call=31/31 Now these are both obviously lower, but I haven't done anything to ensure that the two-drives-per-poller workload is optimized. For all I know, the numa layout is now messed up too. Just as a caveat, but they are comparable to each other. Perf diff again looks similar, note that this time it's perf.data.old that's the stock kernel and perf.data that's the one with your patch: 3.51% +2.84% [kernel.vmlinux] [k] io_issue_sqe 3.24% +1.35% [kernel.vmlinux] [k] io_submit_sqes With the kernel without your patch, I was looking for tag flush overhead but didn't find much: 0.02% io_uring [kernel.vmlinux] [k] blk_mq_free_plug_rqs Outside of the peak worry with the patch, do you have a workload that we should test this on? -- Jens Axboe