On 12/1/21 19:59, Pavel Begunkov wrote:
On 12/1/21 18:10, Willem de Bruijn wrote:
# performance:
The worst case for io_uring is (4), still 1.88 times faster than
msg_zerocopy (2), and there are a couple of "easy" optimisations left
out from the patchset. For 4096 bytes payload zc is only slightly
outperforms non-zc version, the larger payload the wider gap.
I'll get more numbers next time.
Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
much. Notification posting is not a big problem for now, but need
to compare the performance for when io_uring_tx_zerocopy_callback()
is called from IRQ context, and possible rework it to use task_work.
It supports both, regular buffers and fixed ones, but there is a bunch of
optimisations exclusively for io_uring's fixed buffers. For comparison,
normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
1) we pass a bvec, so no page table walks.
2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
still doing page get/put (see 4/12) slashed 4-5%.
3) avoiding get_page/put_page in 5/12
4) completion events are posted into io_uring's CQ, so no
extra recvmsg for getting events
5) no poll(2) in the code because of io_uring
6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
io_uring caches the structures reducing it to nearly zero-overhead.
Nice set of complementary optimizations.
We have looked at adding some of those as independent additions to
msg_zerocopy before, such as long-term pinned regions. One issue with
that is that the pages must remain until the request completes,
regardless of whether the calling process is alive. So it cannot rely
on a pinned range held by a process only.
If feasible, it would be preferable if the optimizations can be added
to msg_zerocopy directly, rather than adding a dependency on io_uring
to make use of them. But not sure how feasible that is. For some, like
4 and 5, the answer is clearly it isn't. 6, it probably is?
Forgot about 6), io_uring uses the fact that submissions are
done under an per ring mutex, and completions are under a per
ring spinlock, so there are two lists for them and no extra
locking. Lists are spliced in a batched manner, so it's
1 spinlock per N (e.g. 32) cached ubuf_info's allocations.
Any similar guarantees for sockets?
[...]
--
Pavel Begunkov