On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu <tianyu.lan@xxxxxxxxx> wrote: > Merry Christmas. > Sorry for later response due to personal affair. > > On 2015年12月14日 03:30, Alexander Duyck wrote: >>> > These sounds we need to add a faked bridge for migration and adding a >>> > driver in the guest for it. It also needs to extend PCI bus/hotplug >>> > driver to do pause/resume other devices, right? >>> > >>> > My concern is still that whether we can change PCI bus/hotplug like that >>> > without spec change. >>> > >>> > IRQ should be general for any devices and we may extend it for >>> > migration. Device driver also can make decision to support migration >>> > or not. >> The device should have no say in the matter. Either we are going to >> migrate or we will not. This is why I have suggested my approach as >> it allows for the least amount of driver intrusion while providing the >> maximum number of ways to still perform migration even if the device >> doesn't support it. > > Even if the device driver doesn't support migration, you still want to > migrate VM? That maybe risk and we should add the "bad path" for the > driver at least. At a minimum we should have support for hot-plug if we are expecting to support migration. You would simply have to hot-plug the device before you start migration and then return it after. That is how the current bonding approach for this works if I am not mistaken. The advantage we are looking to gain is to avoid removing/disabling the device for as long as possible. Ideally we want to keep the device active through the warm-up period, but if the guest doesn't do that we should still be able to fall back on the older approaches if needed. >> >> The solution I have proposed is simple: >> >> 1. Extend swiotlb to allow for a page dirtying functionality. >> >> This part is pretty straight forward. I'll submit a few patches >> later today as RFC that can provided the minimal functionality needed >> for this. > > Very appreciate to do that. > >> >> 2. Provide a vendor specific configuration space option on the QEMU >> implementation of a PCI bridge to act as a bridge between direct >> assigned devices and the host bridge. >> >> My thought was to add some vendor specific block that includes a >> capabilities, status, and control register so you could go through and >> synchronize things like the DMA page dirtying feature. The bridge >> itself could manage the migration capable bit inside QEMU for all >> devices assigned to it. So if you added a VF to the bridge it would >> flag that you can support migration in QEMU, while the bridge would >> indicate you cannot until the DMA page dirtying control bit is set by >> the guest. >> >> We could also go through and optimize the DMA page dirtying after >> this is added so that we can narrow down the scope of use, and as a >> result improve the performance for other devices that don't need to >> support migration. It would then be a matter of adding an interrupt >> in the device to handle an event such as the DMA page dirtying status >> bit being set in the config space status register, while the bit is >> not set in the control register. If it doesn't get set then we would >> have to evict the devices before the warm-up phase of the migration, >> otherwise we can defer it until the end of the warm-up phase. >> >> 3. Extend existing shpc driver to support the optional "pause" >> functionality as called out in section 4.1.2 of the Revision 1.1 PCI >> hot-plug specification. > > Since your solution has added a faked PCI bridge. Why not notify the > bridge directly during migration via irq and call device driver's > callback in the new bridge driver? > > Otherwise, the new bridge driver also can check whether the device > driver provides migration callback or not and call them to improve the > passthough device's performance during migration. This is basically what I had in mind. Though I would take things one step further. You don't need to add any new call-backs if you make use of the existing suspend/resume logic. For a VF this does exactly what you would need since the VFs don't support wake on LAN so it will simply clear the bus master enable and put the netdev in a suspended state until the resume can be called. The PCI hot-plug specification calls out that the OS can optionally implement a "pause" mechanism which is meant to be used for high availability type environments. What I am proposing is basically extending the standard SHPC capable PCI bridge so that we can support the DMA page dirtying for everything hosted on it, add a vendor specific block to the config space so that the guest can notify the host that it will do page dirtying, and add a mechanism to indicate that all hot-plug events during the warm-up phase of the migration are pause events instead of full removals. I've been poking around in the kernel and QEMU code and the part I have been trying to sort out is how to get QEMU based pci-bridge to use the SHPC driver because from what I can tell the driver never actually gets loaded on the device as it is left in the control of ACPI hot-plug. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html