On 09/19/2017 09:56 AM, Michal Privoznik wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=1448268 > > When migrating to a file (e.g. when doing 'virsh save file'), > couple of things are happening in the thread that is executing > the API: > > 1) the domain obj is locked > 2) iohelper is spawned as a separate process to handle all I/O > 3) the thread waits for iohelper to finish > 4) the domain obj is unlocked > > Now, the problem is that while the thread waits in step 3 for > iohelper to finish this may take ages because iohelper calls > fdatasync(). And unfortunately, we are waiting the whole time > with the domain locked. So if another thread wants to jump in and > say copy the domain name ('virsh list' for instance), they are > stuck. > > The solution is to unlock the domain whenever waiting for I/O and > lock it back again when it finished. > > Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> > --- > > diff to v1: > - check after locking the domain obj back if the domain is still alive, > since it was unlocked it might have changed its state. > > src/qemu/qemu_driver.c | 33 +++++++++++++++++++++++++++++---- > 1 file changed, 29 insertions(+), 4 deletions(-) > > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c > index e1a0dd553..6095a5ec5 100644 > --- a/src/qemu/qemu_driver.c > +++ b/src/qemu/qemu_driver.c > @@ -3216,6 +3216,31 @@ qemuOpenFileAs(uid_t fallback_uid, gid_t fallback_gid, > goto cleanup; > } > > + > +static int > +qemuFileWrapperFDClose(virDomainObjPtr vm, > + virFileWrapperFdPtr fd) > +{ > + int ret; > + > + /* virFileWrapperFd uses iohelper to write data onto disk. > + * However, iohelper calls fdatasync() which may take ages to > + * finish. Therefore, we shouldn't be waiting with the domain > + * object locked. */ > + > + virObjectUnlock(vm); > + ret = virFileWrapperFdClose(fd); > + virObjectLock(vm); > + if (!virDomainObjIsActive(vm)) { > + if (!virGetLastError()) > + virReportError(VIR_ERR_OPERATION_FAILED, "%s", > + _("domain is no longer running")); > + ret = -1; > + } > + return ret; > +} > + > + > /* Helper function to execute a migration to file with a correct save header > * the caller needs to make sure that the processors are stopped and do all other > * actions besides saving memory */ > @@ -3276,7 +3301,7 @@ qemuDomainSaveMemory(virQEMUDriverPtr driver, > goto cleanup; > } > > - if (virFileWrapperFdClose(wrapperFd) < 0) > + if (qemuFileWrapperFDClose(vm, wrapperFd) < 0) > goto cleanup; > > if ((fd = qemuOpenFile(driver, vm, path, O_WRONLY, NULL, NULL)) < 0 || > @@ -3827,7 +3852,7 @@ doCoreDump(virQEMUDriverPtr driver, > path); > goto cleanup; > } > - if (virFileWrapperFdClose(wrapperFd) < 0) > + if (qemuFileWrapperFDClose(vm, wrapperFd) < 0) > goto cleanup; > > ret = 0; > @@ -6687,7 +6712,7 @@ qemuDomainRestoreFlags(virConnectPtr conn, > > ret = qemuDomainSaveImageStartVM(conn, driver, vm, &fd, data, path, > false, QEMU_ASYNC_JOB_START); > - if (virFileWrapperFdClose(wrapperFd) < 0) > + if (qemuFileWrapperFDClose(vm, wrapperFd) < 0) > VIR_WARN("Failed to close %s", path); This and the next one - may not be completely true if the -1 return is because the guest not active. Furthermore for both if we had a failure because the domain was inactive the condition wouldn't be propagated as both only care that qemuDomainSaveImageStartVM succeeded. > > qemuProcessEndJob(driver, vm); > @@ -6962,7 +6987,7 @@ qemuDomainObjRestore(virConnectPtr conn, > > ret = qemuDomainSaveImageStartVM(conn, driver, vm, &fd, data, path, > start_paused, asyncJob); > - if (virFileWrapperFdClose(wrapperFd) < 0) > + if (qemuFileWrapperFDClose(vm, wrapperFd) < 0) > VIR_WARN("Failed to close %s", path); > For this one ignoring an active failure means perhaps unlinking the managed save file and then running qemuProcessStart perhaps unexpectedly again... For these two cases there's just enough fear to not be fully supportive for an R-B. For the Save and Dump cases, they're fine, but it would seem to me we really want these restore operations to be quite autonomous and not need to be impacted by some outside influence causing the domain to be inactive again. Besides, the BZ only mentions Save and Dump. if: 1. qemuFileWrapperFDClose is commented to indicate only using for the "non-restore" cases 2. Using virFileWrapperFdClose for the two restore cases then: Reviewed-by: John Ferlan <jferlan@xxxxxxxxxx> else: I need more convincing that the Restore cases can release the lock. IIRC, those have much faster processing since we're reading from file into buffer and not waiting for buffer to be flushed to disk on write... Yes, it impacts 'virsh list' (and other similar commands), but for at least qemuDomainObjRestore there's a fair amount of code afterwards that may assume that the domain is still active. John > cleanup: > -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list