Re: [libvirt PATCH] qemu: Signal domain condition in qemuProcessStop a bit later

Jiri Denemark <jdenemar@xxxxxxxxxx> · Mon, 19 Jul 2021 15:46:39 +0200

On Mon, Jul 19, 2021 at 09:37:01 +0200, Michal Prívozník wrote:
> On 7/16/21 5:06 PM, Jiri Denemark wrote:
> > Signaling the condition before vm->def->id is reset to -1 is dangerous:
> > in case a waiting thread wakes up, it does not see anything interesting
> > (the domain is still marked as running) and just enters virDomainObjWait
> > where it waits forever because the condition will never be signalled
> > again.
> > 
> > Originally it was impossible to get into such situation because the vm
> > object was locked all the time between signaling the condition and
> > resetting vm->def->id, but after commit 860a999802 released in 6.8.0,
> > qemuDomainObjStopWorker called in qemuProcessStop between
> > virDomainObjBroadcast and setting vm->def->id to -1 unlocks the vm
> > object giving other threads a chance to wake up and possibly hang.
> > 
> > In real world, this can be easily reproduced by killing, destroying, or
> > just shutting down (from the guest OS) a domain while it is being
> > migrated somewhere else. The migration job would never finish.
> > 
> > We can't fix this by reseting vm->def->id earlier because other
> > functions (such as qemuProcessKill) do nothing when the domain is
> > already marked as inactive. So let's make sure we delay signaling the
> 
> Not true really. The VIR_QEMU_PROCESS_KILL_NOCHECK flag is passed to
> qemuProcessKill which means the check for domain state (aka vm->def->id
> == -1) is skipped.

Heh, I completely missed this flag when looking at the code :-/

I dropped this part of the commit message.

Jirka