As Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> reported here: https://lore.kernel.org/io-uring/34ecb5c9-5822-827f-6e7b-973bea543569@xxxxxxxxx/T/#me32d6897f976e8268284ff5cbdb3696010c2b7ba we can do a bit better when dealing with inline completions from the submission path. This patchset cleans up the standard completion logic, then builds on top of that to allow collecting completions done at submission time. This allows io_uring to amortize the cost of needing to grab the completion lock, and updating the CQ ring as well. On a silly t/io_uring NOP test on my laptop, this brings about a 20% increase in performance. Xuan Zhuo reports that it changes his SQPOLL based UDP processing (running at 800K PPS) profile from: 17.97% [kernel] [k] copy_user_generic_unrolled 13.92% [kernel] [k] io_commit_cqring 11.04% [kernel] [k] __io_cqring_fill_event 10.33% [kernel] [k] udp_recvmsg 5.94% [kernel] [k] skb_release_data 4.31% [kernel] [k] udp_rmem_release 2.68% [kernel] [k] __check_object_size 2.24% [kernel] [k] __slab_free 2.22% [kernel] [k] _raw_spin_lock_bh 2.21% [kernel] [k] kmem_cache_free 2.13% [kernel] [k] free_pcppages_bulk 1.83% [kernel] [k] io_submit_sqes 1.38% [kernel] [k] page_frag_free 1.31% [kernel] [k] inet_recvmsg to 19.99% [kernel] [k] copy_user_generic_unrolled 11.63% [kernel] [k] skb_release_data 9.36% [kernel] [k] udp_rmem_release 8.64% [kernel] [k] udp_recvmsg 6.21% [kernel] [k] __slab_free 4.39% [kernel] [k] __check_object_size 3.64% [kernel] [k] free_pcppages_bulk 2.41% [kernel] [k] kmem_cache_free 2.00% [kernel] [k] io_submit_sqes 1.95% [kernel] [k] page_frag_free 1.54% [kernel] [k] io_put_req [...] 0.07% [kernel] [k] io_commit_cqring 0.44% [kernel] [k] __io_cqring_fill_event which looks much nicer. Patches are against my for-5.9/io_uring branch. -- Jens Axboe