On Thu, Nov 28, 2019 at 07:29:08 +0000, Nikolay Shirokovskiy wrote: > > > On 27.11.2019 17:56, Peter Krempa wrote: > > On Wed, Nov 27, 2019 at 17:19:18 +0300, Nikolay Shirokovskiy wrote: > >> Due to race qemuDomainGetBlockJobInfo can return there is > >> no block job for disk but later call to spawn new blockjob > >> can fail because libvirt internally still not process blockjob > >> finishing. Thus let's wait for blockjob finishing if we > >> report there is no more blockjob. > > > > Could you please elaborate how this happened? e.g. provide some logs? > > > > Polling with qemuDomainGetBlockJobInfo will always be racy and if the > > problem is that diskPriv->blockjob is allocated but qemu didn't report > > anything I suspect the problem is somewhere else. > > > > > >> > >> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> > >> --- > >> src/qemu/qemu_driver.c | 18 +++++++++++++++++- > >> 1 file changed, 17 insertions(+), 1 deletion(-) > >> > >> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c > >> index 669c12d6ca..b148df3a57 100644 > >> --- a/src/qemu/qemu_driver.c > >> +++ b/src/qemu/qemu_driver.c > >> @@ -17785,12 +17785,26 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom, > >> goto endjob; > >> } > >> > >> + qemuBlockJobSyncBegin(job); > >> + > >> qemuDomainObjEnterMonitor(driver, vm); > >> ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name, &rawInfo); > >> if (qemuDomainObjExitMonitor(driver, vm) < 0) > >> ret = -1; > >> - if (ret <= 0) > >> + if (ret < 0) > >> + goto endjob; > >> + > >> + if (ret == 0) { > > > > So if qemu returns that there is no blockjob ... > > > >> + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE); > >> + while (qemuBlockJobIsRunning(job)) { > > > > but we think it's still running it's either that the event arrived but > > wasn't processed yet. In that case this would help, but the outcome > > would not differ much from the scenario if the code isn't here as the > > event would be processed later. > > > > > >> + if (virDomainObjWait(vm) < 0) { > >> + ret = -1; > >> + goto endjob; > >> + } > >> + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE); > > > > If there's a bug and qemu doesn't know that there's a job but libvirt > > thinks there's one but qemu is right, this will lock up here as the > > event will never arrive. > > > >> + } > >> goto endjob; > > > > My suggested solution would be to fake the job from the data in 'job' > > rather than attempt to wait in this case e.g. > > > > if (ret == 0) { > > rawStats.type = job->type; > > rawStats.ready = 0; > > } > > > > and then make it to qemuBlockJobInfoTranslate which sets some fake > > progress data so that the job looks active. > > > > That way you get a response that there's a job without the potential to > > deadlock. > > > > Fake progress data does not look nice. May be we should cache progress on qemuDomainGetBlockJobInfo > calls so it will not go backwards after the patch? I agree but it's always a tradeoff between complexity of the code and benefit of it. If it's a very unlikely scenario, returning fake data is simpler than adding caching and persistence over restarts. > I can understand protecting against possible bugs when recovering is not possible or difficult > like case of data corruption. But in this case it is matter of libvirtd restart I guess. Well, that's why I'm asking what the original bug is. Maybe the problem lies somewhere else and there's a more elegant fix.
Attachment:
signature.asc
Description: PGP signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list