Re: libvirt-6.5.0 breaks host passthrough migration

Jiri Denemark <jdenemar@xxxxxxxxxx> · Fri, 10 Jul 2020 18:58:34 +0200

On Fri, Jul 10, 2020 at 07:48:26 -0400, Mark Mielke wrote:
> On Fri, Jul 10, 2020 at 7:14 AM Jiri Denemark <jdenemar@xxxxxxxxxx> wrote:
> 
> > On Sun, Jul 05, 2020 at 12:45:55 -0400, Mark Mielke wrote:
> > > With 6.4.0, live migration was working fine with Qemu 5.0. After trying
> > out
> > > 6.5.0, migration broke with the following error:
> > >
> > > libvirt.libvirtError: internal error: unable to execute QEMU command
> > > 'migrate': State blocked by non-migratable CPU device (invtsc flag)
> >
> > Could you please describe the reproducer steps? For example, was the
> > domain you're trying to migrate already running when you upgrade libvirt
> > or is it freshly started by the new libvirt?
> >
> 
> 
> The original case was:
> 
> 1) Machine X running libvirt 6.4.0 + qemu 5.0
> 2) Machine Y running libvirt 6.5.0 + qemu 5.0
> 3) Live migration from X to Y works. Guest appears fine.
> 4) Upgrade Machine X from libvirt 6.4.0 to 6.5.0 and reboot.
> 5) Live migration from Y to X fails with the message shown.

Oh I see, so I guess the bad default is chosen during the incoming
migration to machine Y. I'll try to reproduce myself to see what's going
on.

> In each case, live migration was done with OpenStack Train directing
> libvirt + qemu.
> 
> 
> And it would be helpful to see the <cpu> element as shown by virsh
> > dumpxml before you try to start the domain as well as the QEMU command
> > line libvirt used to start the domain (in
> > /var/log/libvirt/qemu/$VM.log).
> >
> 
> The <cpu> element looks like this:
> 
>   <cpu mode='host-passthrough' check='none'>
>     <topology sockets='1' dies='1' cores='4' threads='2'/>
>   </cpu>
> 
> The QEMU command line is very long, and includes details I would avoid
> publishing publicly unless you need them. The "-cpu" portion is just:
> 
>     -cpu host
> 
> The QEMU command line itself is generated from libvirt, which is directed
> by OpenStack Train.

These are from machine X before step 3, right? Can you also share the
same from machine Y before step 5?

> I wasn't sure what QEMU_CAPS_CPU_MIGRATABLE represents. I initially
> suspected what you are saying, but since it apparently did not work the way
> I expected, I then presumed it does not work the way I expected. :-)
> 
> Is QEMU_CAPS_CPU_MIGRATABLE only from the <cpu> element? If so, doesn't
> this mean that it is not explicitly listed for host-passthrough, and this
> means the check is not detecting whether it is enabled or not properly?

QEMU_CAPS_CPU_MIGRATABLE comes from the QEMU capability probing.
Specifically, the capability is enabled when a given QEMU binary reports
'migratable' property for the CPU object. And the capability detection
tests show we should be properly detecting this capability:

tests/qemucapabilitiesdata $ git grep cpu.migratable
    caps_2.12.0.x86_64.xml:  <flag name='cpu.migratable'/>
    caps_3.0.0.x86_64.xml:  <flag name='cpu.migratable'/>
    caps_3.1.0.x86_64.xml:  <flag name='cpu.migratable'/>
    caps_4.0.0.x86_64.xml:  <flag name='cpu.migratable'/>
    caps_4.1.0.x86_64.xml:  <flag name='cpu.migratable'/>
    caps_4.2.0.x86_64.xml:  <flag name='cpu.migratable'/>
    caps_5.0.0.x86_64.xml:  <flag name='cpu.migratable'/>
    caps_5.1.0.x86_64.xml:  <flag name='cpu.migratable'/>

> I think it can go either way. There is also convention over configuration
> as a competing principle. However, I also prefer explicit. Just, it needs
> to be correct, otherwise explicit can be very bad, as it seems in my case.
> :-)

Of course, the explicit default must match the implicit one.

Jirka