On 7/25/24 14:33, Peter Krempa wrote: > On Thu, Jul 25, 2024 at 12:57:59 +0200, Michal Privoznik wrote: >> Currently, qemuProcessStop() unlocks given domain object right in >> the middle of cleanup process. This is dangerous because there >> might be another thread which is executing virDomainObjListAdd(). >> And since the domain object is on the list of domain objects AND >> by the time qemuProcessStop() unlocks it the object is also >> marked as inactive, the other thread acquires the lock and >> switches vm->def pointer. >> >> The unlocking of domain object is needed though, to allow even >> processing thread finish its queue. Well, the processing can be >> done before any cleanup is attempted. >> >> Therefore, use freshly introduced virEventThreadStop() to join >> the event thread and drop lock/unlock from the middle of >> qemuProcessStop(). >> >> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0 >> Resolves: https://issues.redhat.com/browse/RHEL-49607 >> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> >> --- >> src/qemu/qemu_process.c | 14 ++++++++++---- >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c >> index 6255ba92e7..0869307a90 100644 >> --- a/src/qemu/qemu_process.c >> +++ b/src/qemu/qemu_process.c >> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver, >> * reporting so we don't squash a legit error. */ >> virErrorPreserveLast(&orig_err); >> >> + /* By unlocking the domain object the events processing thread is >> + * allowed to finish its job. */ >> + if (priv->eventThread) { >> + virObjectUnlock(vm); >> + virEventThreadStop(priv->eventThread); >> + virObjectLock(vm); >> + } > > Moving this unlocking above the line where we reset vm->def->id below > directly contradicts ... > > >> + >> if (asyncJob != VIR_ASYNC_JOB_NONE) { >> if (virDomainObjBeginNestedJob(vm, asyncJob) < 0) >> goto cleanup; >> @@ -8681,10 +8689,8 @@ void qemuProcessStop(virQEMUDriver *driver, >> /* Wake up anything waiting on domain condition */ >> virDomainObjBroadcast(vm); >> >> - /* IMPORTANT: qemuDomainObjStopWorker() unlocks @vm in order to prevent >> - * deadlocks with the per-VM event loop thread. This MUST be done after >> - * marking the VM as dead */ > > ... this message here. Now due to other patches I've pushed, this is now > not a problem directly as we now keep the 'beingDestroyed' flag set > during the whole time thanks to my patch and Jirka's patch fixing few > cases where the flag would be leaked. > > What I'm missing is a note why this is safe which I'd expect in the > commit message: > > This patch moves the unlocking of the VM object once again above the > code which marks the VM as inactive by clearing 'vm->def->id'. > > At this point it's now safe to do as other qemu driver code which may > be synchronized on the VM lock now also checks the > 'priv->beingDestroyed' flag in addition to 'vm->def->id'. > > I'd also most likely prefer if 'qemuProcessStop' explicitly sets > 'priv->beingDestroyed' to true before unlocking. The flag will be later > cleared but I didn't really go through all places where it's called to > see if we assert the flag before. I can set the flag, sure. And to some extent it feels safer. But, since the domain object is no longer unlocked I'm wondering whether the flag is needed at all? I mean, with this patch merged, we no longer have a race condition, do we? Michal