RFC: New network forward type pci-passthrough-hybrid I saw a couple of posts regarding PCI-Passthrough usage of SRIOV VF's a couple of weeks ago (20th Jan 2012). Initially I was going to post this RFC along with a set of patches. I would require a few more days to clean my patches for submission and hence I would start with an RFC on a new method to manage PCI-Passthrough of SRIOV VF's. I work for Solarflare Communications who make 10G network adapters. We currently have SRIOV capable adapters available and in production and we would like to work with upstream libvirt to develop the required support for our hardware. This RFC introduces a new network forward mode to libvirt called pci-passthrough-hybrid and provides a solution for migration with PCI-Passthrough as well as providing significant increase in the networking performance. The Solarflare SRIOV driver architecture for KVM is explained in the Release notes which can be found here: https://support.solarflare.com/index.php?view=categories&id=1813&option=com_cognidox&Itemid=2 This is a working model and currently available to Solarflare Customers for evaluation. The hybrid model of the SRIOV driver provided by Solarflare currently achieves the highest SPECvirt performance in the market. Solarflare Ethernet card supports 127 VF's on each port. The MAC address of each unused VF is 00:00:00:00:00:00 by default. Hence the MAC address of the VF does not change on every reboot. There is no VF driver on the host. Each VF does not correspond to an Ethernet device. Instead, VF's are managed using the PCI sysfs files. With the pci-passthrough-hybrid model when the VF is passed into the guest, it appears in the guest as a PCI device and not as a network device. A virtual network device in the form of a virtio interface is also present in the guest. The virtio device in the guest comes from either bridging the physical network device or by creating a macvtap interface of type (vepa, private, bridge) on the physical network device. The virtio device and the VF bind together in the guest to create an accelerated and a non-accelerated path. The new method I wish to propose, uses implicit pci-passthrough and there is no need to provide an explicit <hostdev> element in the domain xml. The hostdev would be added to the live xml as non-persistent as suggested by Laine Stump in a previous post, link to which can be found at: https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html 1) In order to support the above mentioned hybrid model, the requirement is that the VF needs to be assigned the same MAC address as the virtio device in the guest. This enables the VF and the virtio device to bind successfully using the Solarflare driver called XNAP. Effectively we do not need to extend the <hostdev> schema. This can be taken care of by the <interface> element. Along with the MAC address the VLAN tags can also be taken care of by the <interface>/<network> elements. 2) The VF appears in the guest as a PCI device hence the MAC address of the VF is stored in the sysfs files. Assigning the MAC address to the VF before or after pci passthough is not an issue. Proposed steps to support the hybrid model of pci-passthrough in libvirt: 1) <network> will have a new forward type='pci-passthroug-hybrid'. When forward type='pci-passthrough-hybrid' instead of a pool of Ethernet interfaces a <pf> element will need to be specified for implicit VF allocation as shown in the example below: <network> <name>direct-network</name> <forward mode="pci-passthrough-hybrid"> <pf dev="eth2"/> </forward> </network> 2) In the domain's <interface> definition, when type='network' and if network has forward type='pci-passthrough-hybrid', the domain code will request an unused VF from the physical device. Example: <interface type='network'> <source network='direct-network'/> <mac address='00:50:56:0f:86:3b'/> <model type='virtio'/> <actual type='direct'> <source mode='pci-passthrough-hybrid'/> </actual> </interface> 3) The code will then use the NodeDevice API to learn all the necessary PCI domain/slot/bus/function information. 4) Before starting the guest the VF's PCI device name (0000:04:00.2) will be saved in interface/actual so that it can be easily retrieved if libvirtd is restarted. 5) While building the qemu command line, if a network device has forward mode='pci-passthrough-hybrid', the code will add a (non-persisting) <hostdev> element to the qemu command line. This <hostdev> will be marked as ephemeral before passing it to the guest. Ephemeral=transient. 6) During the process of network connection the MAC address of the VF will be set according to the domain <interface> config. This step can also involve setting the VLAN tag, port profiles, etc. 7) Follwoing the above steps the guest will then start with implicit PCI-Passthough of a SRIOV VF. 8) When the guest is eventually destroyed, the Ethernet device will be free'd back to the network pool for use by another guest. Since the MAC address needs to be reset to 00:00:00:00:00:00 we do not need any reference to the higher level device definition. Since the VF is transient, it will be removed when the guest is shutdown and hotplugged again, by the libvirt API, when the guest is started. Hence, in order to get a list of hostdevs attached to a guest we only ever have to look at the <hostdev> element. One of the objections that had been raised following Mr Stump's post was that a transient hostdev will not ensure that the guest PCI address does not get changed each time the guest is run, but since the VF is a pci device in the guest and does not bind to specific driver, we can work with this proposed solution. Migration is possible using the above method without any explicit effort from the user in the following way: 1) Begin stage: All the ephemeral devices do not make their way into the xml that is passed to the destination. 2) Prepare stage: Replacement VF's on the destination, if present, will be automatically reserved and plugged in the guest by the networking code. 3) Perform stage: Any ephemeral device are removed from the guest by libvirt. 4) Confirm stage: If migration fails the VF's will be restored else the VF's will be free's back to the networking pool by the networking code. I have been working on the patches for the above mentioned method and would like to know your take on the hybrid model. -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list