Re: [RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support

Stefano Garzarella <sgarzare@xxxxxxxxxx> · Wed, 3 May 2023 15:54:48 +0200

On Wed, May 3, 2023 at 3:50 PM Arseniy Krasnov <avkrasnov@xxxxxxxxxxxxxx> wrote:
>
>
>
> On 03.05.2023 16:47, Stefano Garzarella wrote:
> > On Wed, May 03, 2023 at 04:11:59PM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 03.05.2023 15:52, Stefano Garzarella wrote:
> >>> Hi Arseniy,
> >>> Sorry for the delay, but I have been very busy.
> >>
> >> Hello, no problem!
> >>
> >>>
> >>> I can't apply this series on master or net-next, can you share with me
> >>> the base commit?
> >>
> >> Here is my base:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54
> >>
> >
> > Thanks, it worked!
> >
> >>>
> >>> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
> >>>> Hello,
> >>>>
> >>>>                           DESCRIPTION
> >>>>
> >>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
> >>>> current implementation for TCP as much as possible:
> >>>>
> >>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
> >>>>   flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
> >>>>   flag will be ignored (e.g. without completion).
> >>>>
> >>>> 2) Kernel uses completions from socket's error queue. Single completion
> >>>>   for single tx syscall (or it can merge several completions to single
> >>>>   one). I used already implemented logic for MSG_ZEROCOPY support:
> >>>>   'msg_zerocopy_realloc()' etc.
> >>>>
> >>>> Difference with copy way is not significant. During packet allocation,
> >>>> non-linear skb is created, then I call 'pin_user_pages()' for each page
> >>>> from user's iov iterator and add each returned page to the skb as fragment.
> >>>> There are also some updates for vhost and guest parts of transport - in
> >>>> both cases i've added handling of non-linear skb for virtio part. vhost
> >>>> copies data from such skb to the guest's rx virtio buffers. In the guest,
> >>>> virtio transport fills tx virtio queue with pages from skb.
> >>>>
> >>>> This version has several limits/problems:
> >>>>
> >>>> 1) As this feature totally depends on transport, there is no way (or it
> >>>>   is difficult) to check whether transport is able to handle it or not
> >>>>   during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
> >>>>   setsockopt callback from setsockopt callback for SOL_SOCKET, but this
> >>>>   leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
> >>>>   are not considered to be called from each other. So in current version
> >>>>   SO_ZEROCOPY is set successfully to any type (e.g. transport) of
> >>>>   AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
> >>>>   tx routine will fail with EOPNOTSUPP.
> >>>
> >>> Do you plan to fix this in the next versions?
> >>>
> >>> If it is too complicated, I think we can have this limitation until we
> >>> find a good solution.
> >>>
> >>
> >> I'll try to fix it again, but just didn't pay attention on it in v2.
> >>
> >>>>
> >>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
> >>>>   one completion. In each completion there is flag which shows how tx
> >>>>   was performed: zerocopy or copy. This leads that whole message must
> >>>>   be send in zerocopy or copy way - we can't send part of message with
> >>>>   copying and rest of message with zerocopy mode (or vice versa). Now,
> >>>>   we need to account vsock credit logic, e.g. we can't send whole data
> >>>>   once - only allowed number of bytes could sent at any moment. In case
> >>>>   of copying way there is no problem as in worst case we can send single
> >>>>   bytes, but zerocopy is more complex because smallest transmission
> >>>>   unit is single page. So if there is not enough space at peer's side
> >>>>   to send integer number of pages (at least one) - we will wait, thus
> >>>>   stalling tx side. To overcome this problem i've added simple rule -
> >>>>   zerocopy is possible only when there is enough space at another side
> >>>>   for whole message (to check, that current 'msghdr' was already used
> >>>>   in previous tx iterations i use 'iov_offset' field of it's iov iter).
> >>>
> >>> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
> >>> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.
> >>> Right?
> >>
> >> Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and
> >> MSG_ZEROCOPY flag as used). But completion structure contains information that
> >> there was copying instead of zerocopying.
> >
> > Got it.
> >
> >>
> >>>
> >>> If it is the case it seems reasonable to me.
> >>>
> >>>>
> >>>> 3) loopback transport is not supported, because it requires to implement
> >>>>   non-linear skb handling in dequeue logic (as we "send" fragged skb
> >>>>   and "receive" it from the same queue). I'm going to implement it in
> >>>>   next versions.
> >>>>
> >>>>   ^^^ fixed in v2
> >>>>
> >>>> 4) Current implementation sets max length of packet to 64KB. IIUC this
> >>>>   is due to 'kmalloc()' allocated data buffers. I think, in case of
> >>>>   MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
> >>>>   not touched for data - user space pages are used as buffers. Also
> >>>>   this limit trims every message which is > 64KB, thus such messages
> >>>>   will be send in copy mode due to 'iov_offset' check in 2).
> >>>>
> >>>>   ^^^ fixed in v2
> >>>>
> >>>>                         PATCHSET STRUCTURE
> >>>>
> >>>> Patchset has the following structure:
> >>>> 1) Handle non-linear skbuff on receive in virtio/vhost.
> >>>> 2) Handle non-linear skbuff on send in virtio/vhost.
> >>>> 3) Updates for AF_VSOCK.
> >>>> 4) Enable MSG_ZEROCOPY support on transports.
> >>>> 5) Tests/tools/docs updates.
> >>>>
> >>>>                            PERFORMANCE
> >>>>
> >>>> Performance: it is a little bit tricky to compare performance between
> >>>> copy and zerocopy transmissions. In zerocopy way we need to wait when
> >>>> user buffers will be released by kernel, so it something like synchronous
> >>>> path (wait until device driver will process it), while in copy way we
> >>>> can feed data to kernel as many as we want, don't care about device
> >>>> driver. So I compared only time which we spend in the 'send()' syscall.
> >>>> Then if this value will be combined with total number of transmitted
> >>>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
> >>>> enough credit, receiver allocates same amount of space as sender needs.
> >>>>
> >>>> Sender:
> >>>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
> >>>>
> >>>> Receiver:
> >>>> ./vsock_perf --vsk-size 256M
> >>>>
> >>>> G2H transmission (values are Gbit/s):
> >>>>
> >>>> *-------------------------------*
> >>>> |          |         |          |
> >>>> | buf size |   copy  | zerocopy |
> >>>> |          |         |          |
> >>>> *-------------------------------*
> >>>> |   4KB    |    3    |    10    |
> >>>> *-------------------------------*
> >>>> |   32KB   |    9    |    45    |
> >>>> *-------------------------------*
> >>>> |   256KB  |    24   |    195   |
> >>>> *-------------------------------*
> >>>> |    1M    |    27   |    270   |
> >>>> *-------------------------------*
> >>>> |    8M    |    22   |    277   |
> >>>> *-------------------------------*
> >>>>
> >>>> H2G:
> >>>>
> >>>> *-------------------------------*
> >>>> |          |         |          |
> >>>> | buf size |   copy  | zerocopy |
> >>>> |          |         |          |
> >>>> *-------------------------------*
> >>>> |   4KB    |    17   |    11    |
> >>>
> >>> Do you know why in this case zerocopy is slower in this case?
> >>> Could be the cost of pin/unpin pages?
> >> May be, i think i need to analyze such enormous difference more. Also about
> >> pin/unpin: i found that there is already implemented function to fill non-linear
> >> skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c.
> >> It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it
> >> is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by
> >> TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY.
> >
> > If we can reuse them, it will be great!
> >
> >>
> >>>
> >>>> *-------------------------------*
> >>>> |   32KB   |    30   |    66    |
> >>>> *-------------------------------*
> >>>> |   256KB  |    38   |    179   |
> >>>> *-------------------------------*
> >>>> |    1M    |    38   |    234   |
> >>>> *-------------------------------*
> >>>> |    8M    |    28   |    279   |
> >>>> *-------------------------------*
> >>>>
> >>>> Loopback:
> >>>>
> >>>> *-------------------------------*
> >>>> |          |         |          |
> >>>> | buf size |   copy  | zerocopy |
> >>>> |          |         |          |
> >>>> *-------------------------------*
> >>>> |   4KB    |    8    |    7     |
> >>>> *-------------------------------*
> >>>> |   32KB   |    34   |    42    |
> >>>> *-------------------------------*
> >>>> |   256KB  |    43   |    83    |
> >>>> *-------------------------------*
> >>>> |    1M    |    40   |    109   |
> >>>> *-------------------------------*
> >>>> |    8M    |    40   |    171   |
> >>>> *-------------------------------*
> >>>>
> >>>> I suppose that huge difference above between both modes has two reasons:
> >>>> 1) We don't need to copy data.
> >>>> 2) We don't need to allocate buffer for data, only for header.
> >>>>
> >>>> Zerocopy is faster than classic copy mode, but of course it requires
> >>>> specific architecture of application due to user pages pinning, buffer
> >>>> size and alignment.
> >>>>
> >>>> If host fails to send data with "Cannot allocate memory", check value
> >>>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
> >>>> allocation.
> >>>
> >>> What the user needs to do? Increase it?
> >>>
> >> Yes, i'll update it.
> >>>>
> >>>>                            TESTING
> >>>>
> >>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
> >>>> cover new code as much as possible so there are different cases for
> >>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
> >>>> vector types (different sizes, alignments, with unmapped pages). I also
> >>>> run tests with loopback transport and running vsockmon.
> >>>
> >>> Thanks for the test again :-)
> >>>
> >>> This cover letter is very good, with a lot of details, but please add
> >>> more details in each single patch, explaining the reason of the changes,
> >>> otherwise it is very difficult to review, because it is a very big
> >>> change.
> >>>
> >>> I'll do a per-patch review in the next days.
> >>
> >> Sure, thanks! In v3 i'm also working on io_uring test, because this thing also
> >> supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.
> >
> > That would be cool!
> >
> > Do you want to me to review these patches or it is better to wait for v3?
>
> I think it is ok to wait for v3, as i'm going to reduce size of new kernel source code,
> especially by reusing already implemented functions instead of my own.

Okay, great! I'll wait for it ;-)

Thanks,
Stefano