Hi, Even if the caller knows that N requests will be submitted, we still have to go all the way to tag allocation for each one. That's somewhat inefficient, when we could just grab as many tags as we need initially instead. This small series allows request caching in the blk_plug. We add a new helper for passing in how many requests we expect to submit, blk_start_plug_nr_ios(), and then we can use that information to populate the cache on the first request allocation. Subsequent queue attempts can then just grab a pre-allocated request out of the cache rather than go into the tag allocation again. This brings single core performance on my setup from: taskset -c 0,16 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1 Added file /dev/nvme1n1 (submitter 0) Added file /dev/nvme2n1 (submitter 1) polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=256 submitter=0, tid=1206 submitter=1, tid=1207 IOPS=5806400, BW=2835MiB/s, IOS/call=32/32, inflight=(77 32) IOPS=5860352, BW=2861MiB/s, IOS/call=32/31, inflight=(102 128) IOPS=5844800, BW=2853MiB/s, IOS/call=32/32, inflight=(102 32) to: taskset -c 0,16 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1 Added file /dev/nvme1n1 (submitter 0) Added file /dev/nvme2n1 (submitter 1) polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=256 submitter=0, tid=1220 submitter=1, tid=1221 IOPS=6061248, BW=2959MiB/s, IOS/call=32/31, inflight=(114 82) IOPS=6100672, BW=2978MiB/s, IOS/call=32/31, inflight=(128 128) IOPS=6082496, BW=2969MiB/s, IOS/call=32/32, inflight=(77 38) which is about a 4-5% improvement in single core performance. Looking at profiles, we're spending _less_ time in sbitmap even though we're doing more work. V2: - Remove hunk from another patch in patch 2 - Make the plugging limits block layer private -- Jens Axboe