On 26.06.2023 19:15, Stefano Garzarella wrote: > On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote: >> Hello, >> >> DESCRIPTION >> >> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >> current implementation for TCP as much as possible: >> >> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this >> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY >> flag will be ignored (e.g. without completion). >> >> 2) Kernel uses completions from socket's error queue. Single completion >> for single tx syscall (or it can merge several completions to single >> one). I used already implemented logic for MSG_ZEROCOPY support: >> 'msg_zerocopy_realloc()' etc. >> >> Difference with copy way is not significant. During packet allocation, >> non-linear skb is created and filled with pinned user pages. >> There are also some updates for vhost and guest parts of transport - in >> both cases i've added handling of non-linear skb for virtio part. vhost >> copies data from such skb to the guest's rx virtio buffers. In the guest, >> virtio transport fills tx virtio queue with pages from skb. >> >> Head of this patchset is: >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b >> >> >> This version has several limits/problems: >> >> 1) As this feature totally depends on transport, there is no way (or it >> is difficult) to check whether transport is able to handle it or not >> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific >> setsockopt callback from setsockopt callback for SOL_SOCKET, but this >> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback >> are not considered to be called from each other. So in current version >> SO_ZEROCOPY is set successfully to any type (e.g. transport) of >> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, >> tx routine will fail with EOPNOTSUPP. >> >> ^^^ >> This is still no resolved :( >> >> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue >> one completion. In each completion there is flag which shows how tx >> was performed: zerocopy or copy. This leads that whole message must >> be send in zerocopy or copy way - we can't send part of message with >> copying and rest of message with zerocopy mode (or vice versa). Now, >> we need to account vsock credit logic, e.g. we can't send whole data >> once - only allowed number of bytes could sent at any moment. In case >> of copying way there is no problem as in worst case we can send single >> bytes, but zerocopy is more complex because smallest transmission >> unit is single page. So if there is not enough space at peer's side >> to send integer number of pages (at least one) - we will wait, thus >> stalling tx side. To overcome this problem i've added simple rule - >> zerocopy is possible only when there is enough space at another side >> for whole message (to check, that current 'msghdr' was already used >> in previous tx iterations i use 'iov_offset' field of it's iov iter). >> >> ^^^ >> Discussed as ok during v2. Link: >> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/ >> >> 3) loopback transport is not supported, because it requires to implement >> non-linear skb handling in dequeue logic (as we "send" fragged skb >> and "receive" it from the same queue). I'm going to implement it in >> next versions. >> >> ^^^ fixed in v2 >> >> 4) Current implementation sets max length of packet to 64KB. IIUC this >> is due to 'kmalloc()' allocated data buffers. I think, in case of >> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is >> not touched for data - user space pages are used as buffers. Also >> this limit trims every message which is > 64KB, thus such messages >> will be send in copy mode due to 'iov_offset' check in 2). >> >> ^^^ fixed in v2 >> >> PATCHSET STRUCTURE >> >> Patchset has the following structure: >> 1) Handle non-linear skbuff on receive in virtio/vhost. >> 2) Handle non-linear skbuff on send in virtio/vhost. >> 3) Updates for AF_VSOCK. >> 4) Enable MSG_ZEROCOPY support on transports. >> 5) Tests/tools/docs updates. >> >> PERFORMANCE >> >> Performance: it is a little bit tricky to compare performance between >> copy and zerocopy transmissions. In zerocopy way we need to wait when >> user buffers will be released by kernel, so it is like synchronous >> path (wait until device driver will process it), while in copy way we >> can feed data to kernel as many as we want, don't care about device >> driver. So I compared only time which we spend in the 'send()' syscall. >> Then if this value will be combined with total number of transmitted >> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not >> enough credit, receiver allocates same amount of space as sender needs. >> >> Sender: >> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc] >> >> Receiver: >> ./vsock_perf --vsk-size 256M >> >> I run tests on two setups: desktop with Core i7 - I use this PC for >> development and in this case guest is nested guest, and host is normal >> guest. Another hardware is some embedded board with Atom - here I don't >> have nested virtualization - host runs on hw, and guest is normal guest. >> >> G2H transmission (values are Gbit/s): >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 | >> *-------------------------------* *-------------------------------* >> | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 | >> *-------------------------------* *-------------------------------* >> | 256KB | 33 | 244 | | 256KB | 7.8 | 55 | >> *-------------------------------* *-------------------------------* >> | 1M | 30 | 373 | | 1M | 7 | 95 | >> *-------------------------------* *-------------------------------* >> | 8M | 22 | 475 | | 8M | 7 | 114 | >> *-------------------------------* *-------------------------------* >> >> H2G: >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 20 | 10 | | 4KB | 4.37 | 3 | >> *-------------------------------* *-------------------------------* >> | 32KB | 37 | 75 | | 32KB | 11 | 18 | >> *-------------------------------* *-------------------------------* >> | 256KB | 44 | 299 | | 256KB | 11 | 62 | >> *-------------------------------* *-------------------------------* >> | 1M | 28 | 335 | | 1M | 9 | 77 | >> *-------------------------------* *-------------------------------* >> | 8M | 27 | 417 | | 8M | 9.35 | 115 | >> *-------------------------------* *-------------------------------* >> >> * Let's look to the first line of both tables - where copy is better >> than zerocopy. I analyzed this case more deeply and found that >> bottleneck is function 'vhost_work_queue()'. With 4K buffer size, >> caller spends too much time in it with zerocopy mode (comparing to >> copy mode). This happens only with 4K buffer size. This function just >> calls 'wake_up_process()' and its internal logic does not depends on >> skb, so i think potential reason (may be) is interval between two >> calls of this function (e.g. how often it is called). Note, that >> 'vhost_work_queue()' differs from the same function at guest's side of >> transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which >> i think is more optimized for worker purposes, than direct call to >> 'wake_up_process()'. But again - this is just my assumption. > > Thanks for the analysis, however for small payloads it makes sense that > the cost might be too high that optimization does not bring benefits. > >> >> Loopback: >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 | >> *-------------------------------* *-------------------------------* >> | 32KB | 38 | 44 | | 32KB | 10 | 10 | >> *-------------------------------* *-------------------------------* >> | 256KB | 55 | 168 | | 256KB | 15 | 36 | >> *-------------------------------* *-------------------------------* >> | 1M | 53 | 250 | | 1M | 12 | 45 | >> *-------------------------------* *-------------------------------* >> | 8M | 40 | 344 | | 8M | 11 | 74 | >> *-------------------------------* *-------------------------------* >> >> I analyzed performace difference more deeply for the following setup: >> server: ./vsock_perf --vsk-size 16M >> client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc] >> >> In other words I send 16M of data from guest to host in copy/zerocopy >> modes and with two different sizes of buffer - 4K and 64K. Let's see >> to tx path for both modes - it consists of two steps: >> >> copy: >> 1) Allocate skb of buffer's length. >> 2) Copy data to skb from buffer. >> >> zerocopy: >> 1) Allocate skb with header space only. >> 2) Pin pages of the buffer and insert them to skb. >> >> I measured average number of ns (returned by 'ktime_get()') for each >> step above: >> 1) Skb allocation (for both copy and zerocopy modes). >> 2) For copy mode in 'memcpy_to_msg()' - copying. >> 3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning. >> >> Here are results for copy mode: >> *-------------------------------------* >> | buf | skb alloc | 'memcpy_to_msg()' | >> *-------------------------------------* >> | | | | >> | 64K | 5000ns | 25000ns | >> | | | | >> *-------------------------------------* >> | | | | >> | 4K | 800ns | 2200ns | >> | | | | >> *-------------------------------------* >> >> Here are results for zerocopy mode: >> *-----------------------------------------------* >> | buf | skb alloc | '__zerocopy_sg_from_iter()' | >> *-----------------------------------------------* >> | | | | >> | 64K | 250ns | 3500ns | >> | | | | >> *-----------------------------------------------* >> | | | | >> | 4K | 250ns | 3000ns | >> | | | | >> *-----------------------------------------------* >> >> I guess that reason of zerocopy performance is low overhead for page >> pinning: there is big difference between 4K and 64K in case of copying >> (25000 vs 2200), but in pinning case - just 3000 vs 3500. >> >> So, zerocopy is faster than classic copy mode, but of course it requires >> specific architecture of application due to user pages pinning, buffer >> size and alignment. > > Makes sense! > >> >> NOTES >> >> If host fails to send data with "Cannot allocate memory", check value >> /proc/sys/net/core/optmem_max - it is accounted during completion skb >> allocation. Try to update it to for example 1M and try send again: >> "echo 1048576 > /proc/sys/net/core/optmem_max" (as root). >> >> TESTING >> >> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to >> cover new code as much as possible so there are different cases for >> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io >> vector types (different sizes, alignments, with unmapped pages). I also >> run tests with loopback transport and run vsockmon. In v3 i've added >> io_uring test as separated application. >> >> LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER >> >> In v3 Stefano Garzarella <sgarzare@xxxxxxxxxx> asked to split this patchset >> for several parts, because it looks too big for review. I think in this >> version (v4) we can do it in the following way: >> >> [0001 - 0005] - this is preparation for virtio/vhost part. >> [0006 - 0009] - this is preparation for AF_VSOCK part. >> [0010 - 0013] - these patches allows to trigger logic from the previous >> two parts. >> [0014 - rest] - updates for doc, tests, utils. This part doesn't touch >> kernel code and looks not critical. > > Yeah, I like this split, but I'd include 14 in the (10, 13) group. > > I have reviewed most of them and I think we are well on our way :-) > I've already seen that Bobby suggested changes for v5, so I'll review > that version better. > > Great work so far! Hello Stefano! Thanks for review! I left some questions, but most of comments are clear for me. So I guess that idea of split is that I still keep all patches in a big single patchset, but preserve structure described above and we will do review process step by step according split? Or I should split this patchset for 3 separated sets? I guess this will be more complex to review... Thanks, Arseniy > > Thanks, > Stefano >