On 5/30/23 2:45?PM, Keith Busch wrote: > On Thu, May 04, 2023 at 09:24:27AM -0700, Keith Busch wrote: >> From: Keith Busch <kbusch@xxxxxxxxxx> >> >> io_uring tries to optimize allocating tags by hinting to the plug how >> many it expects to need for a batch instead of allocating each tag >> individually. But io_uring submission queueus may have a mix of many >> devices for io, so the number of io's counted may be overestimated. This >> can lead to allocating too many tags, which adds overhead to finding >> that many contiguous tags, freeing up the ones we didn't use, and may >> starve out other users that can actually use them. > > When running batched IO to multiple nvme drives, like with t/io_uring, > this shows a tiny improvement in CPU utilization from avoiding the > unlikely clean up condition in __blk_flush_plug() shown below: > > if (unlikely(!rq_list_empty(plug->cached_rq))) > blk_mq_free_plug_rqs(plug); Conceptually the patch certainly makes sense, and is probably as close to ideal on tag reservations as we can get without making things a lot more complicated (or relying on extra app input). But it does mean we'll always be doing a second loop over the submission list, which is obviously not ideal in terms of overhead. I can see a few potential ways forward: 1) Maybe just submit directly if plugging isn't required? That'd leave the pending list just the items that do plug, and avoid the extra loop for those that don't. 2) At least get something out of the extra loop - alloc and prep requests upfront, then submit. If we have to do N things to submit a request, it's always going to be better to bundle each sub-step. This patch doesn't do that, it just kind of takes the easy way out. What do you think? I think these are important questions to answer, particularly as the overhead of flushing extraneous tags seem pretty minuscule. -- Jens Axboe