Re: [RFC] io_uring: optimise requests referencing ctx

Jens Axboe <axboe@xxxxxxxxx> · Sat, 2 Oct 2021 07:09:31 -0600

On 10/2/21 6:15 AM, Pavel Begunkov wrote:
> Currenlty, we allocate one ctx reference per request at submission time
> and put them at free. It's batched and not so expensive but it still
> bloats the kernel, adds 2 function calls for rcu and adds some overhead
> for request counting in io_free_batch_list().
> 
> Always keep one reference with a request, even when it's freed and in
> io_uring request caches. There is extra work at ring exit / quiesce
> paths, which now need to put all cached requests. io_ring_exit_work() is
> already looping, so it's not a problem. Add hybrid-busy waiting to
> io_ctx_quiesce() as well for now.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx>
> ---
> 
> I want to get rid of extra request ctx referencing, but across different
> kernel versions have been getting "interesting" results loosing
> performance for nops test. Thus, it's only RFC to see whether I'm the
> only one seeing weird effects.

I ran this through the usual peak per-core testing:

Setup 1: 3970X, this one ends up being core limited
Setup 2: 5950X, this one ends up being device limited

Peak-1-threads is:

taskset -c 16 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 /dev/nvme1n1

Peak-2-threads is:

taskset -c 0,16 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme2n1 /dev/nvme1n1

where 0/16 are thread siblings.

NOPS is:

taskset -c 16 t/io_uring -b512 -d128 -s32 -c32 -N1

Results are in IOPS, and peak-2-threads is only run on the faster box.

Setup/Test   |  Peak-1-thread   Peak-2-threads   NOPS   Diff
------------------------------------------------------------------
Setup 1 pre  |      3.81M            N/A         47.0M
Setup 1 post |      3.84M            N/A         47.6M  +0.8-1.2%
Setup 2 pre  |      5.11M            5.70M       70.3M
Setup 2 post |      5.17M            5.75M       73.1M  +1.2-4.0%

Looks like a nice win to me, on both setups.

-- 
Jens Axboe