On Wed, Jul 31, 2024 at 8:58 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Wed, Jul 31, 2024 at 03:41:00AM -0400, Michael S. Tsirkin wrote: > > On Wed, Jul 31, 2024 at 08:04:24AM +0100, Daniel P. Berrangé wrote: > > > On Tue, Jul 30, 2024 at 05:32:48PM -0400, Michael S. Tsirkin wrote: > > > > On Tue, Jul 30, 2024 at 04:03:53PM -0400, Peter Xu wrote: > > > > > On Tue, Jul 30, 2024 at 03:22:50PM -0400, Michael S. Tsirkin wrote: > > > > > > This is not what we did historically. Why should we start now? > > > > > > > > > > It's a matter of whether we still want migration to randomly fail, like > > > > > what this patch does. > > > > > > > > > > Or any better suggestions? I'm definitely open to that. > > > > > > > > > > Thanks, > > > > > > > > > > -- > > > > > Peter Xu > > > > > > > > Randomly is an overstatement. You need to switch between kernels > > > > where this feature differs. We did it with a ton of features > > > > in the past, donnu why we single out USO now. > > > > > > This has been a problem with a ton of features in the past. We've > > > ignored the problem, but that doesn't make it the right solution > > > > > > With regards, > > > Daniel > > > > Pushing it to domain xml does not really help, > > migration will still fail unexpectedly (after wasting > > a ton of resources copying memory, and getting > > a downtime bump, I might add). > > Could you elaborate why it would fail if with what I proposed? > > Note that if this is a generic comment about "any migration can fail if we > found a device mismatch", we have plan to fix that to some degree. It's > just that we don't have enough people working on these topics yet. See: > > https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake > > It includes: > > "Check device tree on both sides, etc., to make sure the migration is > applicable. E.g., we should fail early and clearly on any device > mismatch." > > However I don't think it'll cover all checks, e.g. I _think_ even if we > verify VMSDs then post_load() hooks can still fail, and there can be some > corner cases to think. And of course, this may not even apply to virtio > since virtio manages migration itself, without providing a top-level vmsd. > > > > > The right solution is to have a tool that can query > > backends, and that given the results from all of the cluster, > > generate a set of parameters that will ensure migration works. This seems to be very hard for vhost-users. > > Kind of like qemu-img, but for migration. > > This is adding extra work, IMHO. > > If we stick with "qemu cmdline as guest ABI" concept, I think we're all > fine, as that work is done by QEMU booting up first on both sides, > including dest. Probably, letting Qemu to probe is much easier than rewriting the probe in the upper layer. > Basically Libvirt already plays this role of the new tool > without any new code to be added at all: what captured on the boot failure > log will be the output of that tool if we write it. > > Thanks, Thanks > > -- > Peter Xu >