Anthony Liguori <anthony@xxxxxxxxxxxxx> wrote: > On 11/14/2011 04:16 AM, Daniel P. Berrange wrote: >> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: >>> On 11/11/2011 12:15 PM, Kevin Wolf wrote: >>>> Am 10.11.2011 22:30, schrieb Anthony Liguori: >>>>> Live migration with qcow2 or any other image format is just not going to work >>>>> right now even with proper clustered storage. I think doing a block level flush >>>>> cache interface and letting block devices decide how to do it is the best approach. >>>> >>>> I would really prefer reusing the existing open/close code. It means >>>> less (duplicated) code, is existing code that is well tested and doesn't >>>> make migration much of a special case. >>>> >>>> If you want to avoid reopening the file on the OS level, we can reopen >>>> only the topmost layer (i.e. the format, but not the protocol) for now >>>> and in 1.1 we can use bdrv_reopen(). >>>> >>> >>> Intuitively I dislike _reopen style interfaces. If the second open >>> yields different results from the first, does it invalidate any >>> computations in between? >>> >>> What's wrong with just delaying the open? >> >> If you delay the 'open' until the mgmt app issues 'cont', then you loose >> the ability to rollback to the source host upon open failure for most >> deployed versions of libvirt. We only fairly recently switched to a five >> stage migration handshake to cope with rollback when 'cont' fails. > > Delayed open isn't a panacea. With the series I sent, we should be > able to migration with a qcow2 file on coherent shared storage. > > There are two other cases that we care about: migration with nfs > cache!=none and direct attached storage with cache!=none > > Whether the open is deferred matters less with NFS than if the open > happens after the close on the source. To fix NFS cache!=none, we > would have to do a bdrv_close() before sending the last byte of > migration data and make sure that we bdrv_open() after receiving the > last byte of migration data. > > The problem with this IMHO is it creates a large window where noone > has the file open and you're critically vulnerable to losing your VM. Red Hat NFS guru told that fsync() on source + open() after that on target is enough. But anyways, it still depends of nothing else having the file opened on target. > I'm much more in favor of a smarter caching policy. If we can fcntl() > our way to O_DIRECT on NFS, that would be fairly interesting. I'm not > sure if this is supported today but it's something we could look into > adding in the kernel. That way we could force NFS to O_DIRECT during > migration which would solve this problem robustly. We would need O_DIRECT on target during migration, I agree than that would work. > Deferred open doesn't help with direct attached storage. There simple > is no guarantee that there isn't data in the page cache. Yeap, I asked the clustered filesystem people how they fixed the problem, because clustered filesystem have this problem, right. After lots of arm twisting, I got the ioctl(BLKFLSBUF,...), but that only works: - on linux - on some block devices So, we are back to square 1. > Again, I think defaulting DAS to cache=none|directsync is what makes > the most sense here. I think it is the only sane solution. Otherwise, we need to write the equivalent of a lock manager, to know _who_ has the storage, and distributed lock managers are a mess :-( > We can even add a migration blocker for DAS with cache=on. If we can > do dynamic toggling of the cache setting, then that's pretty friendly > at the end of the day. That could fix the problem also. At the moment that we start migration, we do an fsync() + switch to O_DIRECT for all filesystems. As you said, time for implementing fcntl(O_DIRECT). Later, Juan. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html