Re: [PATCH] qemuProcessStop: Don't unlock domain until most of the cleanup is done

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Tue, 23 Jul 2024 14:17:01 +0100

On Tue, Jul 23, 2024 at 03:06:41PM +0200, Peter Krempa wrote:
> On Tue, Jul 23, 2024 at 13:51:09 +0100, Daniel P. Berrangé wrote:
> > On Tue, Jul 23, 2024 at 02:41:13PM +0200, Peter Krempa wrote:
> > > Since I don't currently have a better idea of how to fix this I'm okay
> > > with this patch given the following conditions:
> > > 
> > > > Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
> > > 
> > > Explain that the above commit inverted the order of setting the VM as
> > > inactive and unlocking thus allowing the above sequence of events to
> > > happen, and,
> > 
> > Why not just revert that change ? It claimed to be making things
> > safer, but did the opposite.   Even with this fixup below I'm
> > pretty uncomfortable with setting 'id = -1' and unlocking the
> > @vm, before we've done all our cleanup.
> 
> You'd have to also revert d29e0f3d4a5362d7b33261df1e55896396707de3 which
> is the commit actually moving the unlock of the VM object inside
> qemuProcessStop after setting id = -1 which actually does fix a
> crash on migration.
> 
> We can possibly move the qemuDomainObjStopWorker(vm) call as the last
> thing in qemuProcessStop, so that everything is cleaned up before
> unlocking.
> 
> Either way, unlocking inside qemuProcessStop is fragile as we're giving
> a chance for races which would not be possible before.

Yeah, the unlocking is pretty sketchy. That arrived back in

  commit 860a999802d3c82538373bb3f314f92a2e258754
  Author: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx>
  Date:   Thu Jul 23 11:02:59 2020 +0300

    qemu: avoid deadlock in qemuDomainObjStopWorker

    We are dropping the only reference here so that the event loop thread
    is going to be exited synchronously. In order to avoid deadlocks we
    need to unlock the VM so that any handler being called can finish
    execution and thus even loop thread be finished too.

    Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx>
    Reviewed-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx>
    Reviewed-by: Daniel P. Berrangé <berrange@xxxxxxxxxx>

It is bad design that a mere g_object_unref has side effects in the object
finalizer which synchronize on external code. Finalizers really should only
do self-contained operations that cannot block. ie release resources.
We've got a g_thread_join() in vir_event_thread_finalize which is the root
of our problems.

So I think we need to split the event loop cleanup into two phases.

Right at the start of qemuProcessStop, before doing any cleanup, we could
issue some sort of a "shutdown" call to the event loop, to tell it to
finish processing pending work and stop the event loop. While waiting for
this to be done, we can safely release the mutex.

Then we do the qemuProcessStop as usual, and when it comes time to unref
the event thread object, there won't be any synchronization required, it
will be a plain resource release. Thus we don't need to release the mutex
anymore.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|