> >>> 1) we pass a bvec, so no page table walks. > >>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version > >>> still doing page get/put (see 4/12) slashed 4-5%. > >>> 3) avoiding get_page/put_page in 5/12 > >>> 4) completion events are posted into io_uring's CQ, so no > >>> extra recvmsg for getting events > >>> 5) no poll(2) in the code because of io_uring > >>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info. > >>> io_uring caches the structures reducing it to nearly zero-overhead. > >> > >> Nice set of complementary optimizations. > >> > >> We have looked at adding some of those as independent additions to > >> msg_zerocopy before, such as long-term pinned regions. One issue with > >> that is that the pages must remain until the request completes, > >> regardless of whether the calling process is alive. So it cannot rely > >> on a pinned range held by a process only. > >> > >> If feasible, it would be preferable if the optimizations can be added > >> to msg_zerocopy directly, rather than adding a dependency on io_uring > >> to make use of them. But not sure how feasible that is. For some, like > >> 4 and 5, the answer is clearly it isn't. 6, it probably is? > > Forgot about 6), io_uring uses the fact that submissions are > done under an per ring mutex, and completions are under a per > ring spinlock, so there are two lists for them and no extra > locking. Lists are spliced in a batched manner, so it's > 1 spinlock per N (e.g. 32) cached ubuf_info's allocations. > > Any similar guarantees for sockets? For datagrams it might matter, not sure if it would show up in a profile. The current notification mechanism is quite a bit more heavyweight than any form of fixed ubuf pool. For TCP this matters less, as multiple sends are not needed and completions are coalesced, because in order.