Please find my comments inline. Many Thanks, Regards, Shradha Shah On 09/12/2012 08:01 PM, Laine Stump wrote: > On 09/12/2012 05:59 AM, Daniel P. Berrange wrote: >> On Tue, Sep 11, 2012 at 03:07:25PM -0400, Laine Stump wrote: >>> On 09/07/2012 12:12 PM, Shradha Shah wrote: >>>> This patch series adds the support for interface-type="hostdev-hybrid" and >>>> forward mode="hostdev-hybrid". >>>> >>>> The hostdev-hybrid mode makes migration possible along with PCI-passthrough. >>>> I had posted a RFC on the hostdev-hybrid methodology earlier on the libvirt >>>> mailing list. >>>> >>>> The RFC can be found here: >>>> https://www.redhat.com/archives/libvir-list/2012-February/msg00309.html >>> Before anything else, let me outline what I *think* happens with a >>> hostdev-hybrid device entry, and you can tell me how far off I am :-): >>> >>> * Any hostdev-hybrid interface definition results in 2 PCI devices being >>> added to the guest: >>> >>> a) a PCI passthrough of an SR-IOV VF (done essentially the same as >>> <interface type='hostdev'> >>> b) a virtio-net device which is connected via macvtap "bridge" mode >>> (? is that always the case) to the PF of the VF in (a) >>> >>> * Both of these devices are assigned the same MAC address. >>> >>> * Each of these occupies one PCI address on the guest, so a total of 2 >>> PCI addresses is needed for each hostdev-hybrid "device". (The >>> redundancy in this statement is to be sure that I'm right, as that's an >>> important point :-) >>> >>> * On the guest, these two network devices with matching MAC addresses >>> are put together into a bond interface, with an extra driver that causes >>> the bond to prefer the pci-passthrough device when it is present. So, >>> under normal circumstances *all* traffic goes through the >>> pci-passthrough device. >>> >>> * At migration time, since guests with attached pci-passthrough devices >>> can't be migrated, the pci-passthrough device (which is found by >>> searching the hostdev array for items with the "ephemeral" flag set) is >>> detached. This reduces the bond interface on the guest to only having >>> the virtio-net device, so traffic now passes through that device - it's >>> slower, but connectivity is maintained. >>> >>> * on the destination, a new VF is found, setup with proper MAC address, >>> VLAN, and 802.1QbX port info. A virtio-net device attached to the PF >>> associated with this VF (via macvtap bridge mode) is also setup. The >>> qemu commandline includes an entry for both of these devices. (Question: >>> Is it the virtio-net device that uses the guest PCI address given in the >>> <interface> device info?) (Question: actually, I guess the >>> pci-passthrough device won't be attached until after the guest actually >>> starts running on the destination host, correct?) >>> >>> * When migration is finished, the guest is shut down on the source and >>> started up on the destination, leaving the new instance of the guest >>> temporarily with just a single (virtio-net) device in the bond. >>> >>> * Finally, the pci-passthrough of the VF is attached to the guest, and >>> the guest's bond interface resumes preferring this device, thus >>> restoring full speed networking. >>> >>> Is that all correct? >>> >>> If so, one issue I have is that one of the devices (the >>> pci-passthrough?) doesn't have its guest-side PCI address visible >>> anywhere in the guest's XML, does it? This is problematic, because >>> management applications (and libvirt itself) expect to be able to scan >>> the list of devices to learn what PCI slots are occupied on the guest, >>> and where they can add new devices. >> If that description is correct, > > That's a big "if" - keep in mind the author of the description :-) > (seriously, it's very possible I'm missing some important point) > >> then I have to wonder why we need to >> add all this code for a new "hybrid" device type. It seems to me like >> we can do all this already simply by listing one virtio device and one >> hostdev device in the guest XML. > > Aside from detaching/re-attaching the hostdev, the other thing that > these patches bring is automatic derivation of the <source> of the > virtio-net device from the hostdev. The hostdev device will be grabbed > from a pool of VFs in a <network>, then a "reverse lookup" is done in > PCI space to determine the PF for that VF - that's where the virtio-net > device is connected. > > I suppose this could be handled by 1) putting only the VFs of a single > PF in any network definition's device pool, and 2) always having two > parallel network definitions like this: > > <network> > <name>net-x-vfs-hostdev</name> > <forward mode='hostdev' ephemeral='yes'> > <pf dev='eth3'/> <!-- makes a list of all VFs for PF 'eth3' --> > </forward> > </network> > > <network> > <name>net-x-pf-macvtap</name> > <forward mode='bridge'> > <interface dev='eth3'/> > </forward> > </network> > > Then each guest would have: > > <interface type='network'> > <mac address='x:x:x:x:x:x'/> > <network name='net-x-vfs-hostdev'> > </interface> > <interface type='network'> > <mac address='x:x:x:x:x:x'/> > <network name='net-x-pf-macvtap'> > <model type='virtio'/> > </interface> > > The problem with this is that then you can't have a pool that uses more > than a single PF-worth of VFs. For example, I have an Intel 82576 card > that has 2 PFs and 7 VFs per PF. This would mean that I can only have 7 > VFs in a network. Let's say I have 10 guests and want to migrate them > back and forth between two hosts, I would have to make some arbitrary > decision that some would use "net-x-vfs-hostdev+net-x-pf-macvtap" and > some others would use "net-y-vfs-hostdev+net-y-pf-macvtap". Even worse > would be if I had > 14 guests - there would be artificial limits (beyond > simply "no more than 14 guests/host") on which guests could be moved to > which machine at any given time (I would have to oversubscribe the > 7-guest limit for one pair of networks, and no more than 7 of that > subset of guests could be on the same host at the same time). > > If, instead, the PF used for the virtio-net device is derived from the > particular VF currently assigned to the same guest's hostdev, I can have > a single network definition with VFs from multiple PFs, and they all > become one big pool of resources. In that case, my only limit is the far > simpler "no more than 14 guests/host"; no worries about *which* of the > guests those 14 are. tl;dr - the two-in-one hostdev-hybrid device > simplifies administrative decisions when you have/need multiple PFs. > > (another minor annoyance is that the dual device allows both to use the > same auto-generated MAC address, but if we just use two individual > devices, the MAC must be manually specified for each when the device is > originally defined (so that they will match)). > >> All that's required is to add support >> for the 'ephemeral' against hostdevs, so they are automagically >> unplugged. Technically we don't even need that, since a mgmt app can >> already just use regular hotunplug APIs before issuing the migrate >> API calls. > > I like the idea of having that capability at libvirt's level, so that > you can easily try things out with virsh (is the ephemeral flag > implemented so that it also works for virsh save/restore? That would be > a double plus.) A lot of us don't really use anything higher level than > virsh or virt-manager, especially for testing. The ephemeral flag is not currently implemented so that it works for virsh save/restore, but I can make the changes very easily if required. The ephemeral flag will be an addition to the network XML config similar to "managed". > > (I actually think there's merit to adding the ephemeral flag (can anyone > think of a better name? When I hear ephemeral, I think of that TV chef - > Emeril) for hostdevs in general - it would provide a method of easily > allowing save/restore/migration for guests that have hostdevs that could > be temporarily detached without ill consequences. I think proper > operation would require that qemu notify libvirt when it's *really* > finished detaching a device though (I don't have it at hand right now, > but there's an open BZ requesting that from qemu).) > >> These patches seem to add alot of complexity for mere >> syntactic sugar over existing capabilities. > > I agree that the two-in-one device adds a lot of complexity. If we could > find a way to derive the PF used for the virtio-net device from the VF > used for the hostdev without having a combined two-in-one device entry > (and being able to use a common auto-generated mac address would be nice > too), then I would agree that it should be left as two separate device > entries (if nothing else, this gives us an obvious place to put the PCI > address of the 2nd device). I'm not sure how to do that without limiting > pools to a single PF though. (I know, I know - the solution is for a > higher level management application to modify the guest's config during > migration according to what's in use. But if we're going to do that > anyway, we may as well not have network definitions defining pools of > interfaces in the first place.) > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list