Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net

Luke Gorrie <luke@xxxxxxxx> · Fri, 24 Apr 2015 14:34:37 +0200

On 24 April 2015 at 11:47, Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote:
> Incidentally, we also did a pile of work last year on zero-copy NIC->VM

> transfers and discovered a lot of interesting problems and edge cases where

> Virtio-net spec and/or drivers are hard to match up with common NICs. Happy

> to explain a bit about our experience if that would be valuable.

That sounds interesting, can you describe the setup?

Sure.

We implemented a zero-copy receive path that maps guest buffers received from the avail ring directly onto hardware receive buffers on a dedicated hardware receive queue for that VM (VMDq).

This means that when the NIC receives a packet it stores it directly into the guest's memory but the vswitch has the opportunity to do as much or as little processing as it wants before making the packet available with a used ring descriptor.

This scheme seems quite elegant to me. (I am sure it is not original - this is what the VMDq hardware feature is for, after all.) The devil is in the details though.

I suspect it would work well given two extensions to Virtio-net:

1. The 'used' ring allow an offset where the payload starts.

2. The guest to always supply buffers with space for >= 2048 bytes of payload.

but without these things it is tricky to satisfy the requirements of real NICs such as the Intel 10G ones. There are conflicting requirements. For example:

- NIC requires buffer sizes to be uniform and a multiple of 1024 bytes. Guest suppliers variable-size buffers often of ~1500 bytes. These need to be either rounded down to 1024 bytes (causing excessive segmentation) or rounded up to 2048 bytes (requiring jumbo frames to be globally disabled on the port to avoid potential overruns).

- Virtio-net with MRG_RXBUF expects the packet payload to be in a different offset for the first descriptor in a chain (offset 14 after the vnet header) vs following descriptions in the chain (offset 0). The NIC always stores packets at the same offset so the vswitch needs to pick one and then correct with memmove() when needed.

- If the vswitch wants to shorten the packet payload, e.g. to remove encapsulation, then this will require a memmove() because there is no way to communicate an offset on the used ring.

- The NIC has a limit to how many receive descriptors it can chain together. If the guest is supplying small buffers then this limit may be too low for jumbo frames to be received.

... and at a certain point we decided we were better off switching our focus away from clever-but-fragile NIC hacks and towards clever-and-robust SIMD hacks, and that is the path we have been on since a few months ago.

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization