On 10/2/21 6:15 AM, Pavel Begunkov wrote: > Currenlty, we allocate one ctx reference per request at submission time > and put them at free. It's batched and not so expensive but it still > bloats the kernel, adds 2 function calls for rcu and adds some overhead > for request counting in io_free_batch_list(). > > Always keep one reference with a request, even when it's freed and in > io_uring request caches. There is extra work at ring exit / quiesce > paths, which now need to put all cached requests. io_ring_exit_work() is > already looping, so it's not a problem. Add hybrid-busy waiting to > io_ctx_quiesce() as well for now. > > Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx> > --- > > I want to get rid of extra request ctx referencing, but across different > kernel versions have been getting "interesting" results loosing > performance for nops test. Thus, it's only RFC to see whether I'm the > only one seeing weird effects. I ran this through the usual peak per-core testing: Setup 1: 3970X, this one ends up being core limited Setup 2: 5950X, this one ends up being device limited Peak-1-threads is: taskset -c 16 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 /dev/nvme1n1 Peak-2-threads is: taskset -c 0,16 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme2n1 /dev/nvme1n1 where 0/16 are thread siblings. NOPS is: taskset -c 16 t/io_uring -b512 -d128 -s32 -c32 -N1 Results are in IOPS, and peak-2-threads is only run on the faster box. Setup/Test | Peak-1-thread Peak-2-threads NOPS Diff ------------------------------------------------------------------ Setup 1 pre | 3.81M N/A 47.0M Setup 1 post | 3.84M N/A 47.6M +0.8-1.2% Setup 2 pre | 5.11M 5.70M 70.3M Setup 2 post | 5.17M 5.75M 73.1M +1.2-4.0% Looks like a nice win to me, on both setups. -- Jens Axboe