Re: [RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Wed, May 03, 2023 at 04:11:59PM +0300, Arseniy Krasnov wrote:

On 03.05.2023 15:52, Stefano Garzarella wrote:
Hi Arseniy,
Sorry for the delay, but I have been very busy.

Hello, no problem!

I can't apply this series on master or net-next, can you share with me
the base commit?

Here is my base:

Thanks, it worked!

On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:


this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
current implementation for TCP as much as possible:

1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
  flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
  flag will be ignored (e.g. without completion).

2) Kernel uses completions from socket's error queue. Single completion
  for single tx syscall (or it can merge several completions to single
  one). I used already implemented logic for MSG_ZEROCOPY support:
  'msg_zerocopy_realloc()' etc.

Difference with copy way is not significant. During packet allocation,
non-linear skb is created, then I call 'pin_user_pages()' for each page
from user's iov iterator and add each returned page to the skb as fragment.
There are also some updates for vhost and guest parts of transport - in
both cases i've added handling of non-linear skb for virtio part. vhost
copies data from such skb to the guest's rx virtio buffers. In the guest,
virtio transport fills tx virtio queue with pages from skb.

This version has several limits/problems:

1) As this feature totally depends on transport, there is no way (or it
  is difficult) to check whether transport is able to handle it or not
  during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
  setsockopt callback from setsockopt callback for SOL_SOCKET, but this
  leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
  are not considered to be called from each other. So in current version
  SO_ZEROCOPY is set successfully to any type (e.g. transport) of
  AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
  tx routine will fail with EOPNOTSUPP.

Do you plan to fix this in the next versions?

If it is too complicated, I think we can have this limitation until we
find a good solution.

I'll try to fix it again, but just didn't pay attention on it in v2.

2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
  one completion. In each completion there is flag which shows how tx
  was performed: zerocopy or copy. This leads that whole message must
  be send in zerocopy or copy way - we can't send part of message with
  copying and rest of message with zerocopy mode (or vice versa). Now,
  we need to account vsock credit logic, e.g. we can't send whole data
  once - only allowed number of bytes could sent at any moment. In case
  of copying way there is no problem as in worst case we can send single
  bytes, but zerocopy is more complex because smallest transmission
  unit is single page. So if there is not enough space at peer's side
  to send integer number of pages (at least one) - we will wait, thus
  stalling tx side. To overcome this problem i've added simple rule -
  zerocopy is possible only when there is enough space at another side
  for whole message (to check, that current 'msghdr' was already used
  in previous tx iterations i use 'iov_offset' field of it's iov iter).

So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.

Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and
MSG_ZEROCOPY flag as used). But completion structure contains information that
there was copying instead of zerocopying.

Got it.

If it is the case it seems reasonable to me.

3) loopback transport is not supported, because it requires to implement
  non-linear skb handling in dequeue logic (as we "send" fragged skb
  and "receive" it from the same queue). I'm going to implement it in
  next versions.

  ^^^ fixed in v2

4) Current implementation sets max length of packet to 64KB. IIUC this
  is due to 'kmalloc()' allocated data buffers. I think, in case of
  MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
  not touched for data - user space pages are used as buffers. Also
  this limit trims every message which is > 64KB, thus such messages
  will be send in copy mode due to 'iov_offset' check in 2).

  ^^^ fixed in v2

                        PATCHSET STRUCTURE

Patchset has the following structure:
1) Handle non-linear skbuff on receive in virtio/vhost.
2) Handle non-linear skbuff on send in virtio/vhost.
3) Updates for AF_VSOCK.
4) Enable MSG_ZEROCOPY support on transports.
5) Tests/tools/docs updates.


Performance: it is a little bit tricky to compare performance between
copy and zerocopy transmissions. In zerocopy way we need to wait when
user buffers will be released by kernel, so it something like synchronous
path (wait until device driver will process it), while in copy way we
can feed data to kernel as many as we want, don't care about device
driver. So I compared only time which we spend in the 'send()' syscall.
Then if this value will be combined with total number of transmitted
bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
enough credit, receiver allocates same amount of space as sender needs.

./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]

./vsock_perf --vsk-size 256M

G2H transmission (values are Gbit/s):

|          |         |          |
| buf size |   copy  | zerocopy |
|          |         |          |
|   4KB    |    3    |    10    |
|   32KB   |    9    |    45    |
|   256KB  |    24   |    195   |
|    1M    |    27   |    270   |
|    8M    |    22   |    277   |


|          |         |          |
| buf size |   copy  | zerocopy |
|          |         |          |
|   4KB    |    17   |    11    |

Do you know why in this case zerocopy is slower in this case?
Could be the cost of pin/unpin pages?
May be, i think i need to analyze such enormous difference more. Also about
pin/unpin: i found that there is already implemented function to fill non-linear
skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c.
It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it
is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by

If we can reuse them, it will be great!

|   32KB   |    30   |    66    |
|   256KB  |    38   |    179   |
|    1M    |    38   |    234   |
|    8M    |    28   |    279   |


|          |         |          |
| buf size |   copy  | zerocopy |
|          |         |          |
|   4KB    |    8    |    7     |
|   32KB   |    34   |    42    |
|   256KB  |    43   |    83    |
|    1M    |    40   |    109   |
|    8M    |    40   |    171   |

I suppose that huge difference above between both modes has two reasons:
1) We don't need to copy data.
2) We don't need to allocate buffer for data, only for header.

Zerocopy is faster than classic copy mode, but of course it requires
specific architecture of application due to user pages pinning, buffer
size and alignment.

If host fails to send data with "Cannot allocate memory", check value
/proc/sys/net/core/optmem_max - it is accounted during completion skb

What the user needs to do? Increase it?

Yes, i'll update it.


This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
cover new code as much as possible so there are different cases for
MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
vector types (different sizes, alignments, with unmapped pages). I also
run tests with loopback transport and running vsockmon.

Thanks for the test again :-)

This cover letter is very good, with a lot of details, but please add
more details in each single patch, explaining the reason of the changes,
otherwise it is very difficult to review, because it is a very big

I'll do a per-patch review in the next days.

Sure, thanks! In v3 i'm also working on io_uring test, because this thing also
supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.

That would be cool!

Do you want to me to review these patches or it is better to wait for v3?


Virtualization mailing list

[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux