Re: [Qemu-devel] live migration vs device assignment (motivation)

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Thu, 10 Dec 2015 09:16:07 -0800

On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@xxxxxxxxx> wrote:
>
>
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>
>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>> >hypervisor or qemu to aware the device which means we may need a driver
>>> > in
>>> >hypervisor or qemu to handle the device on behalf of guest driver.
>>
>> Can you answer the question of when do you use your code -
>>     at the start of migration or
>>     just before the end?
>
>
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.

My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.

>>
>>>> > >It would be great if we could avoid changing the guest; but at least
>>>> > > your guest
>>>> > >driver changes don't actually seem to be that hardware specific;
>>>> > > could your
>>>> > >changes actually be moved to generic PCI level so they could be made
>>>> > >to work for lots of drivers?
>>>
>>> >
>>> >It is impossible to use one common solution for all devices unless the
>>> > PCIE
>>> >spec documents it clearly and i think one day it will be there. But
>>> > before
>>> >that, we need some workarounds on guest driver to make it work even it
>>> > looks
>>> >ugly.
>
>
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Please don't assume things.  Unless you have hard data from Microsoft
that says they want it this way lets just try to figure out what works
best for us for now and then we can start worrying about third party
implementations after we have figured out a solution that actually
works.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html