Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()

Peter Krempa <pkrempa@xxxxxxxxxx> · Fri, 26 Jul 2024 11:01:43 +0200

On Thu, Jul 25, 2024 at 15:04:51 +0200, Michal Prívozník wrote:
> On 7/25/24 14:33, Peter Krempa wrote:
> > On Thu, Jul 25, 2024 at 12:57:59 +0200, Michal Privoznik wrote:
> >> Currently, qemuProcessStop() unlocks given domain object right in
> >> the middle of cleanup process. This is dangerous because there
> >> might be another thread which is executing virDomainObjListAdd().
> >> And since the domain object is on the list of domain objects AND
> >> by the time qemuProcessStop() unlocks it the object is also
> >> marked as inactive, the other thread acquires the lock and
> >> switches vm->def pointer.
> >>
> >> The unlocking of domain object is needed though, to allow even
> >> processing thread finish its queue. Well, the processing can be
> >> done before any cleanup is attempted.
> >>
> >> Therefore, use freshly introduced virEventThreadStop() to join
> >> the event thread and drop lock/unlock from the middle of
> >> qemuProcessStop().
> >>
> >> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
> >> Resolves: https://issues.redhat.com/browse/RHEL-49607
> >> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx>
> >> ---
> >>  src/qemu/qemu_process.c | 14 ++++++++++----
> >>  1 file changed, 10 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> >> index 6255ba92e7..0869307a90 100644
> >> --- a/src/qemu/qemu_process.c
> >> +++ b/src/qemu/qemu_process.c
> >> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,

So we're still in qemuProcessStop ...

> >>       * reporting so we don't squash a legit error. */
> >>      virErrorPreserveLast(&orig_err);
> >>  
> >> +    /* By unlocking the domain object the events processing thread is
> >> +     * allowed to finish its job. */
> >> +    if (priv->eventThread) {
> >> +        virObjectUnlock(vm);

... unlocking the VM object, before 'vm->def->id' is set to -1 ...

> >> +        virEventThreadStop(priv->eventThread);
> >> +        virObjectLock(vm);
> >> +    }
> > 
> > Moving this unlocking above the line where we reset vm->def->id below
> > directly contradicts ...
> > 
> > 
> >> +
> >>      if (asyncJob != VIR_ASYNC_JOB_NONE) {
> >>          if (virDomainObjBeginNestedJob(vm, asyncJob) < 0)
> >>              goto cleanup;
> >> @@ -8681,10 +8689,8 @@ void qemuProcessStop(virQEMUDriver *driver,
> >>      /* Wake up anything waiting on domain condition */
> >>      virDomainObjBroadcast(vm);
> >>  
> >> -    /* IMPORTANT: qemuDomainObjStopWorker() unlocks @vm in order to prevent
> >> -     * deadlocks with the per-VM event loop thread. This MUST be done after
> >> -     * marking the VM as dead */
> > 
> > ... this message here. Now due to other patches I've pushed, this is now
> > not a problem directly as we now keep the 'beingDestroyed' flag set
> > during the whole time thanks to my patch and Jirka's patch fixing few
> > cases where the flag would be leaked.
> > 
> > What I'm missing is a note why this is safe which I'd expect in the
> > commit message:
> > 
> >  This patch moves the unlocking of the VM object once again above the
> >  code which marks the VM as inactive by clearing 'vm->def->id'.
> > 
> >  At this point it's now safe to do as other qemu driver code which may
> >  be synchronized on the VM lock now also checks the
> >  'priv->beingDestroyed' flag in addition to 'vm->def->id'.
> > 
> > I'd also most likely prefer if 'qemuProcessStop' explicitly sets
> > 'priv->beingDestroyed' to true before unlocking. The flag will be later
> > cleared but I didn't really go through all places where it's called to
> > see if we assert the flag before.
> 
> I can set the flag, sure. And to some extent it feels safer. But, since
> the domain object is no longer unlocked I'm wondering whether the flag
> is needed at all?

...

the original bug (crash if VM shut down itself during migration)
happened when qemuProcessStop() unlocked @vm before setting vm->def->id
to -1.

Now the impacted code path did have a check that 'priv->beingDestroyed'
is set, but that was not the case before one of my patches following the
fix when I've moved the unlocking *after* setting vm->def->id  to -1.

The original migration bug is no longer possible because
'priv->beingDestroyed' is asserted before entering qemuProcessStop. I
didn't inspect all other callers of qemuProcessStop though whether it
does set 'priv->beingDestroyed'. Since you're moving the locking before
clearing of 'vm->def->id' and qemuProcessStop itself will
unconditionally reset 'priv->beingDestroyed' it's IMO easier just to
set 'priv->beingDestroyed' to be sure before unlocking.

> I mean, with this patch merged, we no longer have a race condition, do we?

As noted above, it depends on which race you mean. The but you are
fixing will be surely fixed, but this patch exactly undoes the fix for
the previous bug, and we'd only not regress in that regard because of
additional hardening later.

> 
> Michal
>