Re: [PATCH 0/8] Hostdev-hybrid patches

Shradha Shah <sshah@xxxxxxxxxxxxxx> · Thu, 13 Sep 2012 12:40:36 +0100

Please find my comments inline.

Many Thanks,
Regards,
Shradha Shah

On 09/12/2012 08:01 PM, Laine Stump wrote:
> On 09/12/2012 05:59 AM, Daniel P. Berrange wrote:
>> On Tue, Sep 11, 2012 at 03:07:25PM -0400, Laine Stump wrote:
>>> On 09/07/2012 12:12 PM, Shradha Shah wrote:
>>>> This patch series adds the support for interface-type="hostdev-hybrid" and
>>>> forward mode="hostdev-hybrid".
>>>>
>>>> The hostdev-hybrid mode makes migration possible along with PCI-passthrough.
>>>> I had posted a RFC on the hostdev-hybrid methodology earlier on the libvirt
>>>> mailing list.
>>>>
>>>> The RFC can be found here:
>>>> https://www.redhat.com/archives/libvir-list/2012-February/msg00309.html
>>> Before anything else, let me outline what I *think* happens with a
>>> hostdev-hybrid device entry, and you can tell me how far off I am :-):
>>>
>>> * Any hostdev-hybrid interface definition results in 2 PCI devices being
>>> added to the guest:
>>>
>>>    a) a PCI passthrough of an SR-IOV VF (done essentially the same as
>>>       <interface type='hostdev'>
>>>    b) a virtio-net device which is connected via macvtap "bridge" mode
>>>       (? is that always the case) to the PF of the VF in (a)
>>>
>>> * Both of these devices are assigned the same MAC address.
>>>
>>> * Each of these occupies one PCI address on the guest, so a total of 2
>>> PCI addresses is needed for each hostdev-hybrid "device". (The
>>> redundancy in this statement is to be sure that I'm right, as that's an
>>> important point :-)
>>>
>>> * On the guest, these two network devices with matching MAC addresses
>>> are put together into a bond interface, with an extra driver that causes
>>> the bond to prefer the pci-passthrough device when it is present. So,
>>> under normal circumstances *all* traffic goes through the
>>> pci-passthrough device.
>>>
>>> * At migration time, since guests with attached pci-passthrough devices
>>> can't be migrated, the pci-passthrough device (which is found by
>>> searching the hostdev array for items with the "ephemeral" flag set) is
>>> detached. This reduces the bond interface on the guest to only having
>>> the virtio-net device, so traffic now passes through that device - it's
>>> slower, but connectivity is maintained.
>>>
>>> * on the destination, a new VF is found, setup with proper MAC address,
>>> VLAN, and 802.1QbX port info. A virtio-net device attached to the PF
>>> associated with this VF (via macvtap bridge mode) is also setup. The
>>> qemu commandline includes an entry for both of these devices. (Question:
>>> Is it the virtio-net device that uses the guest PCI address given in the
>>> <interface> device info?) (Question: actually, I guess the
>>> pci-passthrough device won't be attached until after the guest actually
>>> starts running on the destination host, correct?)
>>>
>>> * When migration is finished, the guest is shut down on the source and
>>> started up on the destination, leaving the new instance of the guest
>>> temporarily with just a single (virtio-net) device in the bond.
>>>
>>> * Finally, the pci-passthrough of the VF is attached to the guest, and
>>> the guest's bond interface resumes preferring this device, thus
>>> restoring full speed networking.
>>>
>>> Is that all correct?
>>>
>>> If so, one issue I have is that one of the devices (the
>>> pci-passthrough?) doesn't have its guest-side PCI address visible
>>> anywhere in the guest's XML, does it? This is problematic, because
>>> management applications (and libvirt itself) expect to be able to scan
>>> the list of devices to learn what PCI slots are occupied on the guest,
>>> and where they can add new devices.
>> If that description is correct,
> 
> That's a big "if" - keep in mind the author of the description :-)
> (seriously, it's very possible I'm missing some important point)
> 
>>  then I have to wonder why we need to
>> add all this code for a new "hybrid" device type. It seems to me like
>> we can do all this already simply by listing one virtio device and one
>> hostdev device in the guest XML.
> 
> Aside from detaching/re-attaching the hostdev, the other thing that
> these patches bring is automatic derivation of the <source> of the
> virtio-net device from the hostdev. The hostdev device will be grabbed
> from a pool of VFs in a <network>, then a "reverse lookup" is done in
> PCI space to determine the PF for that VF - that's where the virtio-net
> device is connected.
> 
> I suppose this could be handled by 1) putting only the VFs of a single
> PF in any network definition's device pool, and 2) always having two
> parallel network definitions like this:
> 
>     <network>
>       <name>net-x-vfs-hostdev</name>
>       <forward mode='hostdev' ephemeral='yes'>
>         <pf dev='eth3'/> <!-- makes a list of all VFs for PF 'eth3' -->
>       </forward>
>     </network>
> 
>     <network>
>       <name>net-x-pf-macvtap</name>
>       <forward mode='bridge'>
>         <interface dev='eth3'/>
>       </forward>
>     </network>
> 
> Then each guest would have:
> 
>    <interface type='network'>
>      <mac address='x:x:x:x:x:x'/>
>      <network name='net-x-vfs-hostdev'>
>    </interface>
>    <interface type='network'>
>      <mac address='x:x:x:x:x:x'/>
>      <network name='net-x-pf-macvtap'>
>      <model type='virtio'/>
>    </interface>
> 
> The problem with this is that then you can't have a pool that uses more
> than a single PF-worth of VFs. For example, I have an Intel 82576 card
> that has 2 PFs and 7 VFs per PF. This would mean that I can only have 7
> VFs in a network. Let's say I have 10 guests and want to migrate them
> back and forth between two hosts, I would have to make some arbitrary
> decision that some would use "net-x-vfs-hostdev+net-x-pf-macvtap" and
> some others would use "net-y-vfs-hostdev+net-y-pf-macvtap". Even worse
> would be if I had > 14 guests - there would be artificial limits (beyond
> simply "no more than 14 guests/host") on which guests could be moved to
> which machine at any given time (I would have to oversubscribe the
> 7-guest limit for one pair of networks, and no more than 7 of that
> subset of guests could be on the same host at the same time).
> 
> If, instead, the PF used for the virtio-net device is derived from the
> particular VF currently assigned to the same guest's hostdev, I can have
> a single network definition with VFs from multiple PFs, and they all
> become one big pool of resources. In that case, my only limit is the far
> simpler "no more than 14 guests/host"; no worries about *which* of the
> guests those 14 are. tl;dr - the two-in-one hostdev-hybrid device
> simplifies administrative decisions when you have/need multiple PFs.
> 
> (another minor annoyance is that the dual device allows both to use the
> same auto-generated MAC address, but if we just use two individual
> devices, the MAC must be manually specified for each when the device is
> originally defined (so that they will match)).
> 
>>  All that's required is to add support
>> for the 'ephemeral' against hostdevs, so they are automagically
>> unplugged. Technically we don't even need that, since a mgmt app can
>> already just use regular hotunplug APIs before issuing the migrate
>> API calls.
> 
> I like the idea of having that capability at libvirt's level, so that
> you can easily try things out with virsh (is the ephemeral flag
> implemented so that it also works for virsh save/restore? That would be
> a double plus.) A lot of us don't really use anything higher level than
> virsh or virt-manager, especially for testing.

The ephemeral flag is not currently implemented so that it works for virsh
save/restore, but I can make the changes very easily if required.

The ephemeral flag will be an addition to the network XML config similar to
"managed".

> 
> (I actually think there's merit to adding the ephemeral flag (can anyone
> think of a better name? When I hear ephemeral, I think of that TV chef -
> Emeril) for hostdevs in general - it would provide a method of easily
> allowing save/restore/migration for guests that have hostdevs that could
> be temporarily detached without ill consequences. I think proper
> operation would require that qemu notify libvirt when it's *really*
> finished detaching a device though (I don't have it at hand right now,
> but there's an open BZ requesting that from qemu).)
> 
>>   These patches seem to add alot of complexity for mere
>> syntactic sugar over existing capabilities.
> 
> I agree that the two-in-one device adds a lot of complexity. If we could
> find a way to derive the PF used for the virtio-net device from the VF
> used for the hostdev without having a combined two-in-one device entry
> (and being able to use a common auto-generated mac address would be nice
> too), then I would agree that it should be left as two separate device
> entries (if nothing else, this gives us an obvious place to put the PCI
> address of the 2nd device). I'm not sure how to do that without limiting
> pools to a single PF though. (I know, I know - the solution is for a
> higher level management application to modify the guest's config during
> migration according to what's in use. But if we're going to do that
> anyway, we may as well not have network definitions defining pools of
> interfaces in the first place.)
> 
> --
> libvir-list mailing list
> libvir-list@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/libvir-list

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list