On 27.11.2019 17:56, Peter Krempa wrote: > On Wed, Nov 27, 2019 at 17:19:18 +0300, Nikolay Shirokovskiy wrote: >> Due to race qemuDomainGetBlockJobInfo can return there is >> no block job for disk but later call to spawn new blockjob >> can fail because libvirt internally still not process blockjob >> finishing. Thus let's wait for blockjob finishing if we >> report there is no more blockjob. > > Could you please elaborate how this happened? e.g. provide some logs? > > Polling with qemuDomainGetBlockJobInfo will always be racy and if the > problem is that diskPriv->blockjob is allocated but qemu didn't report > anything I suspect the problem is somewhere else. > > >> >> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> >> --- >> src/qemu/qemu_driver.c | 18 +++++++++++++++++- >> 1 file changed, 17 insertions(+), 1 deletion(-) >> >> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c >> index 669c12d6ca..b148df3a57 100644 >> --- a/src/qemu/qemu_driver.c >> +++ b/src/qemu/qemu_driver.c >> @@ -17785,12 +17785,26 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom, >> goto endjob; >> } >> >> + qemuBlockJobSyncBegin(job); >> + >> qemuDomainObjEnterMonitor(driver, vm); >> ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name, &rawInfo); >> if (qemuDomainObjExitMonitor(driver, vm) < 0) >> ret = -1; >> - if (ret <= 0) >> + if (ret < 0) >> + goto endjob; >> + >> + if (ret == 0) { > > So if qemu returns that there is no blockjob ... > >> + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE); >> + while (qemuBlockJobIsRunning(job)) { > > but we think it's still running it's either that the event arrived but > wasn't processed yet. In that case this would help, but the outcome > would not differ much from the scenario if the code isn't here as the > event would be processed later. > > >> + if (virDomainObjWait(vm) < 0) { >> + ret = -1; >> + goto endjob; >> + } >> + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE); > > If there's a bug and qemu doesn't know that there's a job but libvirt > thinks there's one but qemu is right, this will lock up here as the > event will never arrive. > >> + } >> goto endjob; > > My suggested solution would be to fake the job from the data in 'job' > rather than attempt to wait in this case e.g. > > if (ret == 0) { > rawStats.type = job->type; > rawStats.ready = 0; > } > > and then make it to qemuBlockJobInfoTranslate which sets some fake > progress data so that the job looks active. > > That way you get a response that there's a job without the potential to > deadlock. > Fake progress data does not look nice. May be we should cache progress on qemuDomainGetBlockJobInfo calls so it will not go backwards after the patch? I can understand protecting against possible bugs when recovering is not possible or difficult like case of data corruption. But in this case it is matter of libvirtd restart I guess. Nikolay -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list