On 24.05.2022 10:32, Stefano Garzarella wrote: > On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov wrote: >> Hello Stefano, >> >> On 19.05.2022 10:42, Stefano Garzarella wrote: >>> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote: >>>> Hello Stefano, >>>> >>>> On 17.05.2022 18:14, Stefano Garzarella wrote: >>>>> Hi Arseniy, >>>>> >>>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote: >>>>>> INTRODUCTION >>>>>> >>>>>> Hello, this is experimental implementation of virtio vsock zerocopy >>>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses >>>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will >>>>>> fill provided vma area with pages of virtio RX buffers. After received data was >>>>>> processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED >>>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail). >>>>> >>>>> Sounds cool, but maybe we would need some socket/net experts here for review. >>>> >>>> Yes, that would be great >>>> >>>>> >>>>> Could we do something similar for the sending path as well? >>>> >>>> Here are thoughts about zerocopy transmission: >>>> >>>> I tried to implement this feature in the following way: user creates >>>> some page aligned buffer, then during tx packet allocation instead of >>>> creating data buffer with 'kmalloc()', i tried to add user's buffer >>>> to virtio queue. But found problem: as kernel virtio API uses virtual >>>> addresses to add new buffers, in the deep of virtio subsystem >>>> 'virt_to_phys()' is called to get physical address of buffer, so user's >>>> virtual address won't be translated correctly to physical address(in >>>> theory, i can perform page walk for such user's va, get physical address >>>> and pass some "fake" virtual address to virtio API in order to make >>>> 'virt_to_phys()' return valid physical address(but i think this is ugly). >>> >>> And maybe we should also pin the pages to prevent them from being replaced. >>> >>> I think we should do something similar to what we do in vhost-vdpa. >>> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c >> >> Hm, ok. I'll read about vdpa... >> >>> >>>> >>>> >>>> If we are talking about 'mmap()' way, i think we can do the following: >>>> user calls 'mmap()' on socket, kernel fills newly created mapping with >>>> allocated pages(all pages have rw permissions). Now user can use pages >>>> of this mapping(e.g. fill it with data). Finally, to start transmission, >>>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of >>>> this mapping. Also as this call will return immediately(e.g. it is >>>> asynchronous), some completion logic must be implemented. For example >>>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get >>>> message that pages could be reused, or don't allow user to work with >>>> these pages: unmap it, perform transmission and finally free pages. >>>> To start new transmission user need to call 'mmap()' again. >>>> >>>> OR >>>> >>>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()' >>>> /'splice()'. In this approach to transmit something, user does the following >>>> steps: >>>> 1) Creates pipe. >>>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it. >>>> SPLICE_F_GIFT allows user to forget about allocated pages - kernel will >>>> free it. >>>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will >>>> move pages from pipe to socket(e.g. in special socket callback we got >>>> set of pipe's pages as input argument and all pages will be inserted >>>> to virtio queue). >>>> >>>> But as SPLICE_F_MOVE support is disabled, it must be repaired first. >>> >>> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2): >>> >>> send(fd, buf, sizeof(buf), MSG_ZEROCOPY); >>> >> >> Yes, but in this way i think: >> 1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue. >> E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP >> case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which >> allows to use such memory as data buffer. > > IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the VQ. Now i see, how to use 'pin_user_pages()'. I'm going to implement zerocopy tx with the same API as TCP MSG_ZEROCOPY > >> 2) To wait tx process is done(e.g. pages can be used again), such API(send + MSG_ZEROCOPY), >> uses socket's error queue - poll events that tx is finished. So same way must be >> implemented for virtio vsock. > > Yeah, I think so. > >> >>> >>>> >>>>> >>>>>> >>>>>> DETAILS >>>>>> >>>>>> Here is how mapping with mapped pages looks exactly: first page mapping >>>>>> contains array of trimmed virtio vsock packet headers (in contains only length >>>>>> of data on the corresponding page and 'flags' field): >>>>>> >>>>>> struct virtio_vsock_usr_hdr { >>>>>> uint32_t length; >>>>>> uint32_t flags; >>>>>> }; >>>>>> >>>>>> Field 'length' allows user to know exact size of payload within each sequence >>>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message >>>>>> bounds or record bounds). All other pages are data pages from RX queue. >>>>>> >>>>>> Page 0 Page 1 Page N >>>>>> >>>>>> [ hdr1 .. hdrN ][ data ] .. [ data ] >>>>>> | | ^ ^ >>>>>> | | | | >>>>>> | *-------------------* >>>>>> | | >>>>>> | | >>>>>> *----------------* >>>>>> >>>>>> Of course, single header could represent array of pages (when packet's >>>>>> buffer is bigger than one page).So here is example of detailed mapping layout >>>>>> for some set of packages. Lets consider that we have the following sequence of >>>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will >>>>>> be inserted to user's vma(vma is large enough). >>>>>> >>>>>> Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] >>>>>> Page 1: [ 56 ] >>>>>> Page 2: [ 4096 ] >>>>>> Page 3: [ 4096 ] >>>>>> Page 4: [ 4096 ] >>>>>> Page 5: [ 8 ] >>>>>> >>>>>> Page 0 contains only array of headers: >>>>>> 'hdr0' has 56 in length field. >>>>>> 'hdr1' has 4096 in length field. >>>>>> 'hdr2' has 8200 in length field. >>>>>> 'hdr3' has 0 in length field(this is end of data marker). >>>>>> >>>>>> Page 1 corresponds to 'hdr0' and has only 56 bytes of data. >>>>>> Page 2 corresponds to 'hdr1' and filled with data. >>>>>> Page 3 corresponds to 'hdr2' and filled with data. >>>>>> Page 4 corresponds to 'hdr2' and filled with data. >>>>>> Page 5 corresponds to 'hdr2' and has only 8 bytes of data. >>>>>> >>>>>> This patchset also changes packets allocation way: today implementation >>>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map >>>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as >>>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and >>>>>> "not large" could be bigger than one page). So to avoid this, data buffers now >>>>>> allocated using 'alloc_pages()' call. >>>>>> >>>>>> TESTS >>>>>> >>>>>> This patchset updates 'vsock_test' utility: two tests for new feature >>>>>> were added. First test covers invalid cases. Second checks valid transmission >>>>>> case. >>>>> >>>>> Thanks for adding the test! >>>>> >>>>>> >>>>>> BENCHMARKING >>>>>> >>>>>> For benchmakring I've added small utility 'rx_zerocopy'. It works in >>>>>> client/server mode. When client connects to server, server starts sending exact >>>>>> amount of data to client(amount is set as input argument).Client reads data and >>>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In >>>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()' >>>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission >>>>>> is better. For server, we can set size of tx buffer and for client we can set >>>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility >>>>>> is quiet simple: >>>>>> >>>>>> For client mode: >>>>>> >>>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx] >>>>>> >>>>>> For server mode: >>>>>> >>>>>> ./rx_zerocopy --mode server [--mb] [--tx] >>>>>> >>>>>> [--mb] sets number of megabytes to transfer. >>>>>> [--rx] sets size of receive buffer/mapping in pages. >>>>>> [--tx] sets size of transmit buffer in pages. >>>>>> >>>>>> I checked for transmission of 4000mb of data. Here are some results: >>>>>> >>>>>> size of rx/tx buffers in pages >>>>>> *---------------------------------------------------* >>>>>> | 8 | 32 | 64 | 256 | 512 | >>>>>> *--------------*--------*----------*---------*----------*----------* >>>>>> | zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to >>>>>> *--------------*---------------------------------------------------- process >>>>>> | non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb >>>>>> *--------------*---------------------------------------------------- >>>>>> >>>>>> I think, that results are not so impressive, but at least it is not worse than >>>>>> copy mode and there is no need to allocate memory for processing date. >>>>> >>>>> Why is it twice as slow in the first column? >>>> >>>> May be this is because memory copying for small buffers is very fast... i'll >>>> analyze it deeply. >>> >>> Maybe I misunderstood, by small buffers here what do you mean? >>> >>> I thought 8 was the number of pages, so 32KB buffers. >> >> Yes, 8 is size in pages. Anyway, i need to check it more deeply. > > Okay, thanks! > > Stefano >