[PATCHSET 0/5] Allow batching of inline completions

Jens Axboe <axboe@xxxxxxxxx> · Tue, 23 Jun 2020 09:16:24 -0600

As Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> reported here:

https://lore.kernel.org/io-uring/34ecb5c9-5822-827f-6e7b-973bea543569@xxxxxxxxx/T/#me32d6897f976e8268284ff5cbdb3696010c2b7ba

we can do a bit better when dealing with inline completions from the
submission path. This patchset cleans up the standard completion
logic, then builds on top of that to allow collecting completions done
at submission time. This allows io_uring to amortize the cost of needing
to grab the completion lock, and updating the CQ ring as well.

On a silly t/io_uring NOP test on my laptop, this brings about a 20%
increase in performance. Xuan Zhuo reports that it changes his SQPOLL
based UDP processing (running at 800K PPS) profile from:

17.97% [kernel] [k] copy_user_generic_unrolled
13.92% [kernel] [k] io_commit_cqring
11.04% [kernel] [k] __io_cqring_fill_event
10.33% [kernel] [k] udp_recvmsg
 5.94% [kernel] [k] skb_release_data
 4.31% [kernel] [k] udp_rmem_release
 2.68% [kernel] [k] __check_object_size
 2.24% [kernel] [k] __slab_free
 2.22% [kernel] [k] _raw_spin_lock_bh
 2.21% [kernel] [k] kmem_cache_free
 2.13% [kernel] [k] free_pcppages_bulk
 1.83% [kernel] [k] io_submit_sqes
 1.38% [kernel] [k] page_frag_free
 1.31% [kernel] [k] inet_recvmsg

to

19.99% [kernel] [k] copy_user_generic_unrolled
11.63% [kernel] [k] skb_release_data
 9.36% [kernel] [k] udp_rmem_release
 8.64% [kernel] [k] udp_recvmsg
 6.21% [kernel] [k] __slab_free
 4.39% [kernel] [k] __check_object_size
 3.64% [kernel] [k] free_pcppages_bulk
 2.41% [kernel] [k] kmem_cache_free
 2.00% [kernel] [k] io_submit_sqes
 1.95% [kernel] [k] page_frag_free
 1.54% [kernel] [k] io_put_req
[...]
 0.07% [kernel] [k] io_commit_cqring
 0.44% [kernel] [k] __io_cqring_fill_event

which looks much nicer.

Patches are against my for-5.9/io_uring branch.

-- 
Jens Axboe