Michael, >> The idea is simple, just to pin the guest VM user space and then >> let host NIC driver has the chance to directly DMA to it. >> The patches are based on vhost-net backend driver. We add a device >> which provides proto_ops as sendmsg/recvmsg to vhost-net to >> send/recv directly to/from the NIC driver. KVM guest who use the >> vhost-net backend may bind any ethX interface in the host side to >> get copyless data transfer thru guest virtio-net frontend. >> >> The scenario is like this: >> >> The guest virtio-net driver submits multiple requests thru vhost-net >> backend driver to the kernel. And the requests are queued and then >> completed after corresponding actions in h/w are done. >> >> For read, user space buffers are dispensed to NIC driver for rx when >> a page constructor API is invoked. Means NICs can allocate user buffers >> from a page constructor. We add a hook in netif_receive_skb() function >> to intercept the incoming packets, and notify the zero-copy device. >> >> For write, the zero-copy deivce may allocates a new host skb and puts >> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data. >> The request remains pending until the skb is transmitted by h/w. >> >> Here, we have ever considered 2 ways to utilize the page constructor >> API to dispense the user buffers. >> >> One: Modify __alloc_skb() function a bit, it can only allocate a >> structure of sk_buff, and the data pointer is pointing to a >> user buffer which is coming from a page constructor API. >> Then the shinfo of the skb is also from guest. >> When packet is received from hardware, the skb->data is filled >> directly by h/w. What we have done is in this way. >> >> Pros: We can avoid any copy here. >> Cons: Guest virtio-net driver needs to allocate skb as almost >> the same method with the host NIC drivers, say the size >> of netdev_alloc_skb() and the same reserved space in the >> head of skb. Many NIC drivers are the same with guest and >> ok for this. But some lastest NIC drivers reserves special >> room in skb head. To deal with it, we suggest to provide >> a method in guest virtio-net driver to ask for parameter >> we interest from the NIC driver when we know which device >> we have bind to do zero-copy. Then we ask guest to do so. >> Is that reasonable? >Unfortunately, this would break compatibility with existing virtio. >This also complicates migration. You mean any modification to the guest virtio-net driver will break the compatibility? We tried to enlarge the virtio_net_config to contains the 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe() will check the feature flag, and get the parameters, then virtio-net driver use it to allocate buffers. How about this? >What is the room in skb head used for? I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to NET_IP_ALIGN. >> Two: Modify driver to get user buffer allocated from a page constructor >> API(to substitute alloc_page()), the user buffer are used as payload >> buffers and filled by h/w directly when packet is received. Driver >> should associate the pages with skb (skb_shinfo(skb)->frags). For >> the head buffer side, let host allocates skb, and h/w fills it. >> After that, the data filled in host skb header will be copied into >> guest header buffer which is submitted together with the payload buffer. >> >> Pros: We could less care the way how guest or host allocates their >> buffers. >> Cons: We still need a bit copy here for the skb header. >> >> We are not sure which way is the better here. >The obvious question would be whether you see any speed difference >with the two approaches. If no, then the second approach would be >better. I remember the second approach is a bit slower in 1500MTU. But we did not tested too much. >> This is the first thing we want >> to get comments from the community. We wish the modification to the network >> part will be generic which not used by vhost-net backend only, but a user >> application may use it as well when the zero-copy device may provides async >> read/write operations later. >> >> Please give comments especially for the network part modifications. >> >> >> We provide multiple submits and asynchronous notifiicaton to >>vhost-net too. >> >> Our goal is to improve the bandwidth and reduce the CPU usage. >> Exact performance data will be provided later. But for simple >> test with netperf, we found bindwidth up and CPU % up too, >> but the bindwidth up ratio is much more than CPU % up ratio. >> >> What we have not done yet: >> packet split support >What does this mean, exactly? We can support 1500MTU, but for jumbo frame, since vhost driver before don't support mergeable buffer, we cannot try it for multiple sg. A jumbo frame will split 5 frags and hook them once a descriptor, so the user buffer allocation is greatly dependent on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for it. >> To support GRO Actually, I think if the mergeable buffer may get good performance, then GRO is not so important then. >And TSO/GSO? Do we really need them? >> Performance tuning >> >> what we have done in v1: >> polish the RCU usage >> deal with write logging in asynchroush mode in vhost >> add notifier block for mp device >> rename page_ctor to mp_port in netdevice.h to make it looks generic >> add mp_dev_change_flags() for mp device to change NIC state >> add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load >> a small fix for missing dev_put when fail >> using dynamic minor instead of static minor number >> a __KERNEL__ protect to mp_get_sock() >> >> what we have done in v2: >> >> remove most of the RCU usage, since the ctor pointer is only >> changed by BIND/UNBIND ioctl, and during that time, NIC will be >> stopped to get good cleanup(all outstanding requests are finished), >> so the ctor pointer cannot be raced into wrong situation. >> >> Remove the struct vhost_notifier with struct kiocb. >> Let vhost-net backend to alloc/free the kiocb and transfer them >> via sendmsg/recvmsg. >> >> use get_user_pages_fast() and set_page_dirty_lock() when read. >> >> Add some comments for netdev_mp_port_prep() and handle_mpassthru(). >> >> >> Comments not addressed yet in this time: >> the async write logging is not satified by vhost-net >> Qemu needs a sync write >> a limit for locked pages from get_user_pages_fast() >> >> >> performance: >> using netperf with GSO/TSO disabled, 10G NIC, >> disabled packet split mode, with raw socket case compared to vhost. >> >> bindwidth will be from 1.1Gbps to 1.7Gbps >> CPU % from 120%-140% to 140%-160% -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html