On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote: > Date: Fri, 3 Jun 2022 05:27:56 +0000 > From: Arseniy Krasnov <AVKrasnov@xxxxxxxxxxxxxx> > Subject: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive > > INTRODUCTION > > Hello, this is experimental implementation of virtio vsock zerocopy > receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses > same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will > fill provided vma area with pages of virtio RX buffers. After received data was > processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED > flag set(if user won't call 'madvise()', next 'getsockopt()' will fail). > > DETAILS > > Here is how mapping with mapped pages looks exactly: first page mapping > contains array of trimmed virtio vsock packet headers (in contains only length > of data on the corresponding page and 'flags' field): > > struct virtio_vsock_usr_hdr { > uint32_t length; > uint32_t flags; > uint32_t copy_len; > }; > > Field 'length' allows user to know exact size of payload within each sequence > of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message > bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part. > All other pages are data pages from RX queue. > > Page 0 Page 1 Page N > > [ hdr1 .. hdrN ][ data ] .. [ data ] > | | ^ ^ > | | | | > | *-------------------* > | | > | | > *----------------* > > Of course, single header could represent array of pages (when packet's > buffer is bigger than one page).So here is example of detailed mapping layout > for some set of packages. Lets consider that we have the following sequence of > packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will > be inserted to user's vma(vma is large enough). > > Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] Hi Arseniy, what about adding a `header` for `virtio_vsock_usr_hdr` in Page 0? Page 0 can be like this: Page 0: [[ header for hdrs ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] We can store the header numbers/page numbers in this first general header: struct virtio_vsock_general_hdr { uint32_t usr_hdr_num; }; This usr_hdr_num represents how many pages we used here. At most 256 pages will be used here, and this kind of statistical information is useful. > Page 1: [ 56 ] > Page 2: [ 4096 ] > Page 3: [ 4096 ] > Page 4: [ 4096 ] > Page 5: [ 8 ] > > Page 0 contains only array of headers: > 'hdr0' has 56 in length field. > 'hdr1' has 4096 in length field. > 'hdr2' has 8200 in length field. > 'hdr3' has 0 in length field(this is end of data marker). > > Page 1 corresponds to 'hdr0' and has only 56 bytes of data. > Page 2 corresponds to 'hdr1' and filled with data. > Page 3 corresponds to 'hdr2' and filled with data. > Page 4 corresponds to 'hdr2' and filled with data. > Page 5 corresponds to 'hdr2' and has only 8 bytes of data. > > This patchset also changes packets allocation way: today implementation > uses only 'kmalloc()' to create data buffer. Problem happens when we try to map > such buffers to user's vma - kernel forbids to map slab pages to user's vma(as > pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and > "not large" could be bigger than one page). So to avoid this, data buffers now > allocated using 'alloc_pages()' call. > > TESTS > > This patchset updates 'vsock_test' utility: two tests for new feature > were added. First test covers invalid cases. Second checks valid transmission > case. > > BENCHMARKING > > For benchmakring I've added small utility 'rx_zerocopy'. It works in > client/server mode. When client connects to server, server starts sending exact > amount of data to client(amount is set as input argument).Client reads data and > waits for next portion of it. Client works in two modes: copy and zero-copy. In > copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()' > /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission > is better. For server, we can set size of tx buffer and for client we can set > size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility > is quiet simple: > > For client mode: > > ./rx_zerocopy --mode client [--zerocopy] [--rx] > > For server mode: > > ./rx_zerocopy --mode server [--mb] [--tx] > > [--mb] sets number of megabytes to transfer. > [--rx] sets size of receive buffer/mapping in pages. > [--tx] sets size of transmit buffer in pages. > > I checked for transmission of 4000mb of data. Here are some results: > > size of rx/tx buffers in pages > *---------------------------------------------------* > | 8 | 32 | 64 | 256 | 512 | > *--------------*--------*----------*---------*----------*----------* > | zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to > *--------------*---------------------------------------------------- process > | non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb > *--------------*---------------------------------------------------- > > Result in first column(where non-zerocopy works better than zerocopy) happens > because time, spent in 'read()' system call is smaller that time in 'getsockopt' > + 'madvise'. I've checked that. > > I think, that results are not so impressive, but at least it is not worse than > copy mode and there is no need to allocate memory for processing date. > > PROBLEMS > > Updated packet's allocation logic creates some problem: when host gets > data from guest(in vhost-vsock), it allocates at least one page for each packet > (even if packet has 1 byte payload). I think this could be resolved in several > ways: > 1) Make zerocopy rx mode disabled by default, so if user didn't enable > it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2) > 2) Use 'kmalloc()' for "small" packets, else call page allocator. But > in this case, we have mix of packets, allocated in two different ways thus > during zerocopying to user(e.g. mapping pages to vma), such small packets will > be handled in some stupid way: we need to allocate one page for user, copy data > to it and then insert page to user's vma. > > v1 -> v2: > 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I > didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this > feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK > layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of > SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now > renamed to SO_VM_SOCKETS_MAP_RX. > 2) Packet header which is exported to user now get new field: 'copy_len'. > This field handles special case: user reads data from socket in non > zerocopy way(with disabled zerocopy) and then enables zerocopy feature. > In this case vhost part will switch data buffer allocation logic from > 'kmalloc()' to direct calls for buddy allocator. But, there could be > some pending 'kmalloc()' allocated packets in socket's rx list, and then > user tries to read such packets in zerocopy way, dequeue will fail, > because SLAB pages could not be inserted to user's vm area. So when such > packet is found during zerocopy dequeue, dequeue loop will break and > 'copy_len' will show size of such "bad" packet. After user detects this > case, it must use 'read()/recv()' calls to dequeue such packet. > 3) Also may be move this features under config option? > > Arseniy Krasnov(8) > virtio/vsock: rework packet allocation logic > vhost/vsock: rework packet allocation logic > af_vsock: add zerocopy receive logic > virtio/vsock: add transport zerocopy callback > vhost/vsock: enable zerocopy callback > virtio/vsock: enable zerocopy callback > test/vsock: add receive zerocopy tests > test/vsock: vsock rx zerocopy utility > > drivers/vhost/vsock.c | 121 +++++++++-- > include/linux/virtio_vsock.h | 5 + > include/net/af_vsock.h | 7 + > include/uapi/linux/virtio_vsock.h | 6 + > include/uapi/linux/vm_sockets.h | 3 + > net/vmw_vsock/af_vsock.c | 100 +++++++++ > net/vmw_vsock/virtio_transport.c | 51 ++++- > net/vmw_vsock/virtio_transport_common.c | 211 ++++++++++++++++++- > tools/include/uapi/linux/virtio_vsock.h | 11 + > tools/include/uapi/linux/vm_sockets.h | 8 + > tools/testing/vsock/Makefile | 1 + > tools/testing/vsock/control.c | 34 +++ > tools/testing/vsock/control.h | 2 + > tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++++ > tools/testing/vsock/vsock_test.c | 295 ++++++++++++++++++++++++++ > 15 files changed, 1196 insertions(+), 15 deletions(-) > > -- > 2.25.1