On 7/11/22 5:56 AM, Pavel Begunkov wrote: > On 7/8/22 15:26, Pavel Begunkov wrote: >> On 7/8/22 05:10, David Ahern wrote: >>> On 7/7/22 5:49 AM, Pavel Begunkov wrote: >>>> NOTE: Not be picked directly. After getting necessary acks, I'll be >>>> working >>>> out merging with Jakub and Jens. >>>> >>>> The patchset implements io_uring zerocopy send. It works with both >>>> registered >>>> and normal buffers, mixing is allowed but not recommended. Apart >>>> from usual >>>> request completions, just as with MSG_ZEROCOPY, io_uring separately >>>> notifies >>>> the userspace when buffers are freed and can be reused (see API >>>> design below), >>>> which is delivered into io_uring's Completion Queue. Those >>>> "buffer-free" >>>> notifications are not necessarily per request, but the userspace has >>>> control >>>> over it and should explicitly attaching a number of requests to a >>>> single >>>> notification. The series also adds some internal optimisations when >>>> used with >>>> registered buffers like removing page referencing. >>>> >>>> From the kernel networking perspective there are two main changes. >>>> The first >>>> one is passing ubuf_info into the network layer from io_uring >>>> (inside of an >>>> in kernel struct msghdr). This allows extra optimisations, e.g. >>>> ubuf_info >>>> caching on the io_uring side, but also helps to avoid cross-referencing >>>> and synchronisation problems. The second part is an optional >>>> optimisation >>>> removing page referencing for requests with registered buffers. >>>> >>>> Benchmarking with an optimised version of the selftest (see [1]), >>>> which sends >>>> a bunch of requests, waits for completions and repeats. "+ flush" >>>> column posts >>>> one additional "buffer-free" notification per request, and just "zc" >>>> doesn't >>>> post buffer notifications at all. >>>> >>>> NIC (requests / second): >>>> IO size | non-zc | zc | zc + flush >>>> 4000 | 495134 | 606420 (+22%) | 558971 (+12%) >>>> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) >>>> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) >>>> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) >>>> >>>> dummy (requests / second): >>>> IO size | non-zc | zc | zc + flush >>>> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) >>>> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) >>>> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) >>>> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) >>>> >>>> Previously it also brought a massive performance speedup compared to >>>> the >>>> msg_zerocopy tool (see [3]), which is probably not super interesting. >>>> >>> >>> can you add a comment that the above results are for UDP. >> >> Oh, right, forgot to add it >> >> >>> You dropped comments about TCP testing; any progress there? If not, can >>> you relay any issues you are hitting? >> >> Not really a problem, but for me it's bottle necked at NIC bandwidth >> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. >> Was actually benchmarked by my colleague quite a while ago, but can't >> find numbers. Probably need to at least add localhost numbers or grab >> a better server. > > Testing localhost TCP with a hack (see below), it doesn't include > refcounting optimisations I was testing UDP with and that will be > sent afterwards. Numbers are in MB/s > > IO size | non-zc | zc > 1200 | 4174 | 4148 > 4096 | 7597 | 11228 I am surprised by the low numbers; you should be able to saturate a 100G link with TCP and ZC TX API. > > Because it's localhost, we also spend cycles here for the recv side. > Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the > omitted optimisations will somewhat help. I don't consider it to be a > blocker. but would be interesting to poke into later. One thing helping > non-zc is that it squeezes a number of requests into a single page > whenever zerocopy adds a new frag for every request. > > Can't say anything new for larger payloads, I'm still NIC-bound but > looking at CPU utilisation zc doesn't drain as much cycles as non-zc. > Also, I don't remember if mentioned before, but another catch is that > with TCP it expects users to not be flushing notifications too much, > because it forces it to allocate a new skb and lose a good chunk of > benefits from using TCP. I had issues with TCP sockets and io_uring at the end of 2020: https://www.spinics.net/lists/io-uring/msg05125.html have not tried anything recent (from 2022).