On 11.11.2022 17:06, Stefano Garzarella wrote: > On Fri, Nov 11, 2022 at 2:47 PM Stefano Garzarella <sgarzare@xxxxxxxxxx> wrote: >> >> Hi Arseniy, >> maybe we should start rebasing this series on the new support for >> skbuff: >> https://lore.kernel.org/lkml/20221110171723.24263-1-bobby.eshleman@xxxxxxxxxxxxx/ >> >> CCing Bobby to see if it's easy to integrate since you're both changing >> the packet allocation. Hello Stefano, Sure! I was waiting for next version of skbuff support(previous one was in august as i remember): 1) Anyway it will be implemented as skbuff is general well-known thing for networking :) 2) It will simplify MSG_ZEROCOPY support, because it uses API based on skbuff. >> >> >> On Sun, Nov 06, 2022 at 07:33:41PM +0000, Arseniy Krasnov wrote: >>> >>> >>> INTRODUCTION >>> >>> Hello, >>> >>> This is experimental patchset for virtio vsock zerocopy receive. >>> It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses >>> same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()' >>> to fill provided vma area with pages of virtio receive buffers. After >>> received data was processed by user, pages must be freed by 'madvise()' >>> call with MADV_DONTNEED flag set(but if user will not call 'madvise()', >>> next 'getsockopt()' will fail). >>> >>> DETAILS >>> >>> Here is how mapping with mapped pages looks exactly: first page >>> contains information about mapped data buffers. At zero offset mapping >>> contains special data structure: >>> >>> struct virtio_vsock_usr_hdr_pref { >>> u32 poll_value; >>> u32 hdr_num; >>> }; >>> >>> This structure contains two fields: >>> 'poll_value' - shows that current socket has data to read.When socket's >>> intput queue is empty, 'poll_value' is set to 0 by kernel. When input >>> queue has some data, 'poll_value' is set to 1 by kernel. When socket is >>> closed for data receive, 'poll_value' is ~0.This tells user that "there >>> will be no more data,continue to call 'getsockopt()' until you'll find >>> 'hdr_num' == 0".User spins on it in userspace, without calling 'poll()' >>> system call(of course, 'poll()' is still working). >>> 'hdr_num' - shows number of mapped pages with data which starts from >>> second page of this mappined. >>> >>> NOTE: >>> This version has two limitations: >>> >>> 1) One mapping per socket is supported. It is implemented by adding >>> 'struct page*' pointer to 'struct virtio_vsock' structure (first >>> page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But, >>> I think, support for multiple pages could be implemented by using >>> something like hash table of such pages, or more simple, just use >>> first page of mapping as headers page by default. Also I think, >>> number of such pages may be controlled by 'setsockop()'. >>> >>> 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even >>> after calling 'madvise()'/'munmap()' on the whole mapping.This is >>> because socket can't handle 'munmap()' calls(as there is no such >>> callback in 'proto_ops'),thus polling page exists until socket is >>> opened. >>> >>> After 'virtio_vsock_usr_hdr_pref' object, first page contains array of >>> trimmed virtio vsock packet headers (in contains only length of data on >>> the corresponding page and 'flags' field): >>> >>> struct virtio_vsock_usr_hdr { >>> uint32_t length; >>> uint32_t flags; >>> }; >>> >>> Field 'length' allows user to know exact size of payload within each >>> sequence of pages and field 'flags' allows to process SOCK_SEQPACKET >>> flags(such as message bounds or record bounds).All other pages are data >>> pages from virtio queue. >>> >>> Page 0 Page 1 Page N >>> >>> [ pref hdr0 .. hdrN ][ data ] .. [ data ] >>> | | ^ ^ >>> | | | | >>> | *-------|-----------* >>> | | >>> *----------------* >>> >>> Of course, single header could represent array of pages (when >>> packet's buffer is bigger than one page).So here is example of detailed >>> mapping layout for some set of packages. Lets consider that we have the >>> following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All >>> pages: 0,1,2,3,4 and 5 will be inserted to user's vma. >>> >>> Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] >>> Page 1: [ 56 ] >>> Page 2: [ 4096 ] >>> Page 3: [ 4096 ] >>> Page 4: [ 4096 ] >>> Page 5: [ 8 ] >>> >>> Page 0 contains only array of headers: >>> 'pref' is 'struct virtio_vsock_usr_hdr_pref'. >>> 'hdr0' has 56 in length field. >>> 'hdr1' has 4096 in length field. >>> 'hdr2' has 8200 in length field. >>> 'hdr3' has 0 in length field(this is end of data marker). >>> >>> Page 1 corresponds to 'hdr0' and has only 56 bytes of data. >>> Page 2 corresponds to 'hdr1' and filled with data. >>> Page 3 corresponds to 'hdr2' and filled with data. >>> Page 4 corresponds to 'hdr2' and filled with data. >>> Page 5 corresponds to 'hdr2' and has only 8 bytes of data. >>> >>> pref will be the following: poll_value = 1, hdr_num = 5 >>> >>> This patchset also changes packets allocation way: current uses >>> only 'kmalloc()' to create data buffer. Problem happens when we try to >>> map such buffers to user's vma - kernel restricts to map slab pages >>> to user's vma(as pages of "not large" 'kmalloc()' allocations have flag >>> PageSlab set and "not large" could be bigger than one page).So to avoid >>> this, data buffers now allocated using 'alloc_pages()' call. >>> >>> DIFFERENCE WITH TCP >>> >>> As this feature uses same approach as for TCP protocol,here are >>> some difference between both version(from user's POV): >>> >>> 1) For 'getsockopt()': >>> - This version passes only address of mapping. >>> - TCP passes special structure to 'getsockopt()'. In addition to the >>> address of mapping in contains 'length' and 'recv_skip_hint'.First >>> means size of data inside mapping(out param, set by kernel).Second >>> has bool type, if it is true, then user must dequeue rest of data >>> using 'read()' syscall(e.g. it is out parameter also). >>> 2) Mapping structure: >>> - This version uses first page of mapping for meta data and rest of >>> pages for data. >>> - TCP version uses whole mapping for data only. >>> 3) Data layout: >>> - This version inserts virtio buffers to mapping, so each buffer may >>> be filled partially. To get size of payload in every buffer, first >>> mapping's page must be used(see 2). >>> - TCP version inserts pages of single skb. >>> >>> *Please, correct me if I made some mistake in TCP zerocopy description. >> >> >> Thank you for the description. Do you think it would be possible to try >> to do the same as TCP? >> Especially now that we should support skbuff. Yes, i'll rework my patchset for skbuff usage. >> >>> >>> TESTS >>> >>> This patchset updates 'vsock_test' utility: two tests for new >>> feature were added. First test covers invalid cases.Second checks valid >>> transmission case. >> >> Thank you, I really appreciate you adding new tests each time! Great >> job! >> >>> >>> BENCHMARKING >>> >>> For benchmakring I've created small test utility 'vsock_rx_perf'. >>> It works in client/server mode. When client connects to server, server >>> starts sending specified amount of data to client(size is set as input >>> argument). Client reads data and waits for next portion of it. In client >>> mode, dequeue logic works in three modes: copy, zerocopy and zerocopy >>> with user polling. >> >> Cool, thanks for adding it in this series. >> >>> >>> 1) In copy mode client uses 'read()' system call. >>> 2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data >>> and 'poll()' to wait data. >>> 3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()', >>> but to wait data it polls shared page(e.g. busyloop). >>> >>> Here is usage: >>> -c <cid> Peer CID to connect to(if run in client mode). >>> -m <megabytes> Number of megabytes to send. >>> -b <bytes> Size of RX/TX buffer(or mapping) in pages. >>> -r <bytes> SO_RCVLOWAT value in bytes(if run in client mode). >>> -v <bytes> peer credit. >>> -s Run as server. >>> -z [n|y|u] Dequeue mode. >>> n - copy mode. 1) above. >>> y - zero copy mode. 2) above. >>> u - zero copy mode + user poll. 3) above. >>> >>> Utility produces the following output: >>> 1) In server mode it prints number of sec spent for whole tx loop. >>> 2) In client mode it prints several values: >>> * Number of sec, spent for whole rx loop(including 'poll()'). >>> * Number of sec, spend in dequeue system calls: >>> In case of '-z n' it will be time in 'read()'. >>> In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'. >>> * Number of wake ups with POLLIN flag set(except '-z u' mode). >>> * Average time(ns) in single dequeue iteration(e.g. divide second >>> value by third). >>> >>> Idea of test is to compare zerocopy approach and classic copy, as it is >>> clear, that to dequeue some "small" amount of data, copy must be better, >>> because syscall with 'memcpy()' for 1 byte(for example) is just nothing >>> against two system calls, where first must map at least one page, while >>> second will unmap it. >>> >>> Test was performed with the following settings: >>> 1) Total size of data to send is 2G(-m argument). >>> >>> 2) Peer's buffer size is changed to 2G(-v argument) - this is needed to >>> avoid stalls of sender to wait for enough credit. >>> >>> 3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number >>> of bytes to dequeue in single loop iteration. Buffer size limits max >>> number of bytes to read, while SO_RCVLOWAT won't allow user to get >>> too small number of bytes. >>> >>> 4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of >>> course we can set it to peer's buffer size and as we are in STREAM >>> mode it leads to 'write()' will be called once. >>> >>> Deignations here and below: >>> H2G - host to guest transmission. Server is host, client is guest. >>> G2H - guest to host transmission. Server is guest, client is host. >>> C - copy mode. >>> ZC - zerocopy mode. >>> ZU - zerocopy with user poll mode. This mode is removed from test at >>> this moment, because I need to support SO_RCVLOWAT logic in it. >>> >>> So, rows corresponds to dequeue mode, while columns show number of >> >> Maybe it would be better to label the rows, I guess the first one is C >> and the second one ZC? Ooops, seems something wrong happened during patch send, here is valid table: G2H: #0 #1 #2 #3 #4 #5 *----*---------*---------*---------*---------*---------*---------* | | | | | | | | | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | | | | | | | | | *----*---------*---------*---------*---------*---------*---------* | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35| #0| C | 1.44 | 1.81 | 1.14 | 1.47 | 1.32 | 1.78 | | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 | *----*---------*---------*---------*---------*---------*---------* | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46| #1| ZC | 1.7 | 1.76 | 0.96 | 0.7 | 0.58 | 0.61 | | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 | *----*---------*---------*---------*---------*---------*---------* H2G: #0 #1 #2 #3 #4 #5 *----*---------*---------*---------*---------*---------*---------* | | | | | | | | | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | | | | | | | | | *----*---------*---------*---------*---------*---------*---------* | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00| #0| C | 3.3 | 4.3 | 2.37 | 2.33 | 2.42 | 2.75 | | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 | *----*---------*---------*---------*---------*---------*---------* | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35| #1| ZC | 3.5 | 3.38 | 2.32 | 1.75 | 1.25 | 0.98 | | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 | *----*---------*---------*---------*---------*---------*---------* Lines with #0 and #1 missed! Don know why, sorry:) >> >> Maybe it would be better to report Gbps so if we change the amount of >> data exchanged, we always have a way to compare. >> Ok >>> bytes >>> to dequeue in each mode. Each cell contains several values in the next >>> format: >>> *------------* >>> | A / B | >>> | C | >>> | D | >>> *------------* >>> >>> A - number of seconds which server spent in tx loop. >>> B - number of seconds which client spent in rx loop. >>> C - number of seconds which client spent in rx loop, but except 'poll()' >>> system call(e.g. only in dequeue system calls). >>> D - Average number of ns for each POLLIN wake up(in other words >>> it is average value for C). >> >> I see only 3 values in the cells, I missed which one is C and which one >> is D. Yep, correct version of table is above >> >>> >>> G2H: >>> >>> #0 #1 #2 #3 #4 #5 >>> *----*---------*---------*---------*---------*---------*---------* >>> | | | | | | | | >>> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | >>> | | | | | | | | >>> *----*---------*---------*---------*---------*---------*---------* >>> | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35| >>> | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 | >>> *----*---------*---------*---------*---------*---------*---------* >>> | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46| >>> | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 | >>> *----*---------*---------*---------*---------*---------*---------* >>> >>> H2G: >>> >>> #0 #1 #2 #3 #4 #5 >>> *----*---------*---------*---------*---------*---------*---------* >>> | | | | | | | | >>> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | >>> | | | | | | | | >>> *----*---------*---------*---------*---------*---------*---------* >>> | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00| >>> | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 | >>> *----*---------*---------*---------*---------*---------*---------* >>> | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35| >>> | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 | >>> *----*---------*---------*---------*---------*---------*---------* >>> >>> Here are my thoughts about these numbers(most obvious): >>> >>> 1) Let's check C and D values. We see, that zerocopy dequeue is faster >>> on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I >>> think this is main result of this test(at this moment), that shows >>> performance difference between copy and zerocopy). >> >> Yes, I think this is expected. >> >>> >>> 2) In G2H mode both server and client spend almost same time in rx/tx >>> loops(see A / B in G2H table) - it looks good. In H2G mode, there is >>> significant difference between server and client. I think there are >>> some side effects which produces such effect(continue to analyze). >> >> Perhaps a different cost to notify the receiver? I think it's better to >> talk about transmitter and receiver, instead of server and client, I >> think it's confusing. Ok >> >>> >>> 3) Let's check C value. We can see, that G2H is always faster that H2G. >>> In both copy and zerocopy mode. >> >> This is expected because the guest queues buffers up to 64K entirely, >> while the host has to split packets into the guest's preallocated >> buffers, which are 4K. >> >>> >>> 4) Another interesting thing could be seen for example in H2G table, >>> row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode >>> is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was >> >> I see 1.65 vs 1.10, are these the same data, or am I looking at it >> wrong? Aha, pls see valid version of the table, my fault >> >>> faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account >>> time spent in 'poll()', copy mode looks faster(even it spends more >>> time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()'). >>> I think, it is also not obvious effect. >>> >>> So, according 1), it is better to use zerocopy, if You need to process >>> big buffers, with small rx waitings(for example it nay be video stream). >>> In other cases - it is better to use classic copy way, as it will be >>> more lightweight. >>> >>> All tests were performed on x86 board with 4-core Celeron N2930 CPU(of >>> course it is, not a mainframe, but better than test with nested guest) >>> and 8Gb of RAM. >>> >>> Anyway, this is not final version, and I will continue to improve both >>> kernel logic and performance tests. >> >> Great work so far! >> >> Maybe to avoid having to rebase everything later, it's already >> worthwhile to start using Bobby's patch with skbuff. >> >>> >>> SUGGESTIONS >>> >>> 1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I >>> can merge both patches into single one? >> >> This is already very big, so I don't know if it's worth breaking into a >> preparation series and then a series that adds both. Hm, ok, I think i can send both series separately(MSG_ZEROCOPY is smaller). It will be more simple to test and review. Which one of two will be ready to merge shortly - the second will be rebased and retested. > > For example, some test patches not related to zerocopy could go > separately. Maybe even vsock_rx_perf without the zerocopy part that is > not definitive for now. Ok, i think i can send patch which updates current tests as single one - it will be easy to review and merge. Also vsock_rx_perf: i can prepare it as dedicated patch. Of course, without zerocopy it has same functionality as iperf, but i think it will good to have independent small tool which implements both rx and tx zerocopy support(vsock_rx_perf will be named vsock_perf for example). It is small tool(comparing to iperf), targeted for vsock only and located in kernel source tree. > > Too big a set is always scary, even if this one is divided well, but > some patches as mentioned could go separately. > > I left some comments, but as said I prefer to review it after the > rebase with skbuff, because I think it changes enough. I'm sorry about > that, but having the skbuffs I think is very important. Sure, no problem. Of course skbuff is correct way! Thanks for review! > > Thanks, > Stefano >