On Sun, Nov 29, 2015 at 10:53 PM, Lan, Tianyu <tianyu.lan@xxxxxxxxx> wrote: > On 11/26/2015 11:56 AM, Alexander Duyck wrote: >> >> > I am not saying you cannot modify the drivers, however what you are >> doing is far too invasive. Do you seriously plan on modifying all of >> the PCI device drivers out there in order to allow any device that >> might be direct assigned to a port to support migration? I certainly >> hope not. That is why I have said that this solution will not scale. > > > Current drivers are not migration friendly. If the driver wants to > support migration, it's necessary to be changed. Modifying all of the drivers directly will not solve the issue though. This is why I have suggested looking at possibly implementing something like dma_mark_clean() which is used for ia64 architectures to mark pages that were DMAed in as clean. In your case though you would want to mark such pages as dirty so that the page migration will notice them and move them over. > RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and > DMA tracking during migration. These are common for most drivers and > they maybe problematic in the previous version but can be corrected later. They can only be corrected if the underlying assumptions are correct and they aren't. Your solution would have never worked correctly. The problem is you assume you can keep the device running when you are migrating and you simply cannot. At some point you will always have to stop the device in order to complete the migration, and you cannot stop it before you have stopped your page tracking mechanism. So unless the platform has an IOMMU that is somehow taking part in the dirty page tracking you will not be able to stop the guest and then the device, it will have to be the device and then the guest. > Doing suspend and resume() may help to do migration easily but some > devices requires low service down time. Especially network and I got > that some cloud company promised less than 500ms network service downtime. Honestly focusing on the downtime is getting the cart ahead of the horse. First you need to be able to do this without corrupting system memory and regardless of the state of the device. You haven't even gotten to that state yet. Last I knew the device had to be up in order for your migration to even work. Many devices are very state driven. As such you cannot just freeze them and restore them like you would regular device memory. That is where something like suspend/resume comes in because it already takes care of getting the device ready for halt, and then resume. Keep in mind that those functions were meant to function on a device doing something like a suspend to RAM or disk. This is not too far of from what a migration is doing since you need to halt the guest before you move it. As such the first step is to make it so that we can do the current bonding approach with one change. Specifically we want to leave the device in the guest until the last portion of the migration instead of having to remove it first. To that end I would suggest focusing on solving the DMA problem via something like a dma_mark_clean() type solution as that would be one issue resolved and we all would see an immediate gain instead of just those users of the ixgbevf driver. > So I think performance effect also should be taken into account when we > design the framework. What you are proposing I would call premature optimization. You need to actually solve the problem before you can start optimizing things and I don't see anything actually solved yet since your solution is too unstable. >> >> What I am counter proposing seems like a very simple proposition. It >> can be implemented in two steps. >> >> 1. Look at modifying dma_mark_clean(). It is a function called in >> the sync and unmap paths of the lib/swiotlb.c. If you could somehow >> modify it to take care of marking the pages you unmap for Rx as being >> dirty it will get you a good way towards your goal as it will allow >> you to continue to do DMA while you are migrating the VM. >> >> 2. Look at making use of the existing PCI suspend/resume calls that >> are there to support PCI power management. They have everything >> needed to allow you to pause and resume DMA for the device before and >> after the migration while retaining the driver state. If you can >> implement something that allows you to trigger these calls from the >> PCI subsystem such as hot-plug then you would have a generic solution >> that can be easily reproduced for multiple drivers beyond those >> supported by ixgbevf. > > > Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug > controller and these event are defined in the controller spec. > It's hard to extend more events. Otherwise, we also need to add some > specific codes in the PCI hotplug core since it's only add and remove > PCI device when it gets events. It's also a challenge to modify Windows > hotplug codes. So we may need to find another way. For now we can use conventional hot-plug. Removing the device should be fairly quick and I suspect it would only dirty a few megs of memory so just using conventional hot-plug for now is probably workable. The suspend/resume approach would be a follow-up in order to improve the speed of migration since those functions are more lightweight then a remove/probe. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html