Re: [PATCHSET v3 0/5] Enable bio recycling for polled IO

Jens Axboe <axboe@xxxxxxxxx> · Tue, 10 Aug 2021 10:44:27 -0600

On 8/10/21 10:37 AM, Jens Axboe wrote:
> Hi,
> 
> This is v3 of this patchset. We're back to passing the cache pointer
> in the kiocb, I do think that's the cleanest and it's also the most
> efficient approach. A patch has been added to remove a member from
> the io_uring req_rw structure, so that the kiocb size bump doesn't
> result in the per-command part of io_kiocb to bump into the next
> cacheline.
> 
> Another benefit of this approach is that we get per-ring caching.
> That means if an application splits polled IO into two threads, one
> doing submit and one doing reaps, then we still get the full benefit
> of the bio caching.
> 
> The tldr; here is that we get about a 10% bump in polled performance with
> this patchset, as we can recycle bio structures essentially for free.
> Outside of that, explanations in each patch. I've also got an iomap patch,
> but trying to keep this single user until there's agreement on the
> direction.
> 
> Against for-5.15/io_uring, and can also be found in my
> io_uring-bio-cache.3 branch.

As a reference. Before the patch:

axboe@amd ~/g/fio (master)> sudo taskset -c 0  t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 /dev/nvme3n1
i 8, argc 9
Added file /dev/nvme3n1 (submitter 0)
sq_ring ptr = 0x0x7f63a7c42000
sqes ptr    = 0x0x7f63a7c40000
cq_ring ptr = 0x0x7f63a7c3e000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=1600
IOPS=3111520, IOS/call=32/31, inflight=128 (128)

or around 3.1M IOPS (single thread, single core), and after:

axboe@amd ~/g/fio (master)> sudo taskset -c 0  t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 /dev/nvme3n1
i 8, argc 9
Added file /dev/nvme3n1 (submitter 0)
sq_ring ptr = 0x0x7f62726bc000
sqes ptr    = 0x0x7f62726ba000
cq_ring ptr = 0x0x7f62726b8000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=1791
IOPS=3417120, IOS/call=32/31, inflight=128 (128)

which is about a ~10% increase in per-core IOPS for this kind of
workload.

-- 
Jens Axboe