On Tue, Jul 23, 2024 at 03:06:41PM +0200, Peter Krempa wrote: > On Tue, Jul 23, 2024 at 13:51:09 +0100, Daniel P. Berrangé wrote: > > On Tue, Jul 23, 2024 at 02:41:13PM +0200, Peter Krempa wrote: > > > Since I don't currently have a better idea of how to fix this I'm okay > > > with this patch given the following conditions: > > > > > > > Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0 > > > > > > Explain that the above commit inverted the order of setting the VM as > > > inactive and unlocking thus allowing the above sequence of events to > > > happen, and, > > > > Why not just revert that change ? It claimed to be making things > > safer, but did the opposite. Even with this fixup below I'm > > pretty uncomfortable with setting 'id = -1' and unlocking the > > @vm, before we've done all our cleanup. > > You'd have to also revert d29e0f3d4a5362d7b33261df1e55896396707de3 which > is the commit actually moving the unlock of the VM object inside > qemuProcessStop after setting id = -1 which actually does fix a > crash on migration. > > We can possibly move the qemuDomainObjStopWorker(vm) call as the last > thing in qemuProcessStop, so that everything is cleaned up before > unlocking. > > Either way, unlocking inside qemuProcessStop is fragile as we're giving > a chance for races which would not be possible before. Yeah, the unlocking is pretty sketchy. That arrived back in commit 860a999802d3c82538373bb3f314f92a2e258754 Author: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> Date: Thu Jul 23 11:02:59 2020 +0300 qemu: avoid deadlock in qemuDomainObjStopWorker We are dropping the only reference here so that the event loop thread is going to be exited synchronously. In order to avoid deadlocks we need to unlock the VM so that any handler being called can finish execution and thus even loop thread be finished too. Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> Reviewed-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx> Reviewed-by: Daniel P. Berrangé <berrange@xxxxxxxxxx> It is bad design that a mere g_object_unref has side effects in the object finalizer which synchronize on external code. Finalizers really should only do self-contained operations that cannot block. ie release resources. We've got a g_thread_join() in vir_event_thread_finalize which is the root of our problems. So I think we need to split the event loop cleanup into two phases. Right at the start of qemuProcessStop, before doing any cleanup, we could issue some sort of a "shutdown" call to the event loop, to tell it to finish processing pending work and stop the event loop. While waiting for this to be done, we can safely release the mutex. Then we do the qemuProcessStop as usual, and when it comes time to unref the event thread object, there won't be any synchronization required, it will be a plain resource release. Thus we don't need to release the mutex anymore. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|