On 7/14/22 12:55 PM, Pavel Begunkov wrote: >>>>> You dropped comments about TCP testing; any progress there? If not, >>>>> can >>>>> you relay any issues you are hitting? >>>> >>>> Not really a problem, but for me it's bottle necked at NIC bandwidth >>>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. >>>> Was actually benchmarked by my colleague quite a while ago, but can't >>>> find numbers. Probably need to at least add localhost numbers or grab >>>> a better server. >>> >>> Testing localhost TCP with a hack (see below), it doesn't include >>> refcounting optimisations I was testing UDP with and that will be >>> sent afterwards. Numbers are in MB/s >>> >>> IO size | non-zc | zc >>> 1200 | 4174 | 4148 >>> 4096 | 7597 | 11228 >> >> I am surprised by the low numbers; you should be able to saturate a 100G >> link with TCP and ZC TX API. > > It was a quick test with my laptop, not a super fast CPU, preemptible > kernel, etc., and considering that the fact that it processes receives > from in the same send syscall roughly doubles the overhead, 87Gb/s > looks ok. It's not like MSG_ZEROCOPY would look much different, even > more to that all sends here will be executed sequentially in io_uring, > so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable, > it's just the kernel overhead per byte is too high, should be same with > just send(2). ? It's a stream socket so those sends are coalesced into MTU sized packets. > >>> Because it's localhost, we also spend cycles here for the recv side. >>> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the >>> omitted optimisations will somewhat help. I don't consider it to be a >>> blocker. but would be interesting to poke into later. One thing helping >>> non-zc is that it squeezes a number of requests into a single page >>> whenever zerocopy adds a new frag for every request. >>> >>> Can't say anything new for larger payloads, I'm still NIC-bound but >>> looking at CPU utilisation zc doesn't drain as much cycles as non-zc. >>> Also, I don't remember if mentioned before, but another catch is that >>> with TCP it expects users to not be flushing notifications too much, >>> because it forces it to allocate a new skb and lose a good chunk of >>> benefits from using TCP. >> >> I had issues with TCP sockets and io_uring at the end of 2020: >> https://www.spinics.net/lists/io-uring/msg05125.html >> >> have not tried anything recent (from 2022). > > Haven't seen it back then. In general io_uring doesn't stop submitting > requests if one request fails, at least because we're trying to execute > requests asynchronously. And in general, requests can get executed > out of order, so most probably submitting a bunch of requests to a single > TCP sock without any ordering on io_uring side is likely a bug. TCP socket buffer fills resulting in a partial send (i.e, for a given sqe submission only part of the write/send succeeded). io_uring was not handling that case. I'll try to find some time to resurrect the iperf3 patch and try top of tree kernel. > > You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing > execution ordering. And if you meant links in the message, I agree > that it was not the best decision to consider len < sqe->len not > an error and not breaking links, but it was later added that > MSG_WAITALL would also change the success condition to > len==sqe->len. But all that is relevant if you was using linking. >