On Thu, Sep 01, 2016 at 14:46:01 +0800, Xiubo Li wrote: > When doing the snapshot using the script below: > ========================================= > !#/bin/bash > virsh blockjob $instance_name vdX --abort > virsh undefine $instance_name > qemu-img create -f qcow2 -o backing_file=$diskfile,size=$size $path/$uuid.dlta > virsh blockcopy --domain $instance_name vdX $path/$uuid.dlta --wait --reuse-external --verbose > virsh blockjob $instance_name vdX --abort > qemu-img convert -f qcow2 -O qcow2 $path/$uuid.dlta $path/$uuid.qcow2 > echo "start uploading at $(date)" > ...... > ========================================= > > Sometimes you can encounter the following warning and error logs: > +++++++++++++++++++++++++++++++++++++++++ > 2016-08-26 07:51:05.685+0000: 8916: warning : qemuDomainObjBeginJobInternal:1571 : > Cannot start job (query, none) for domain instance-00001b3a; current job is (modify, > none) owned by (8914 remoteDispatchDomainBlockJobAbort, 0 <null>) for (30s, 0s) > > 2016-08-26 07:51:05.685+0000: 8916: error : qemuDomainObjBeginJobInternal:1583 : > Timed out during operation: cannot acquire state change lock (held by > remoteDispatchDomainBlockJobAbort) > ----------------------------------------- > > Mostly it will be okay later. But sometimes the state change lock maybe hold by > remoteDispatchDomainBlockJobAbort forever, then the instance couldn't be connected > through the VNC or the network(ping,ssh...), expect after rebooting the instance. > > >From the code, we find that after sending the --abort command, there will be two > steps by condition waiting the two reply signals: > +++++++++++++++++++++++++++++++++++++++++ > The first condition wait is: > In qemuMonitorBlockJobCancel() --> qemuMonitoJSONBlockJobCancel() --> > qemuMonitorJSONCommand() --> qemuMonitorJSONCommandWithFd() --> > qemuMonitorSend() --> virCondWait(&mon->notify, &mon->parent.lock). > > The second condition wait is: > In virDomainCondWait(vm) --> virCondWait(&vm->cond, &vm->parent.lock). > ----------------------------------------- > > The two corresponding replies are: > +++++++++++++++++++++++++++++++++++++++++ > The "return" reply is: > QEMU_MONITOR_RECV_REPLY: mon=0x7ff3fc001020 reply={"return": [], "id": "libvirt-338"} > With the condition wakeup signal: > virCondBroadcast(&mon->notify) > > The "event" reply is: > QEMU_MONITOR_RECV_EVENT: mon=0x7fe08401a3f0 event={"timestamp": {"seconds": 1472524518, > "microseconds": 360135}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": > "drive-virtio-disk0", "len": 42949672960, "offset": 10485760, "speed": 0, "type": "mirror"}} > With the condition wakeup signal: > virDomainObjBroadcast(vm) --> virCondBroadcast(&vm->cond) > ----------------------------------------- > > But sometimes the qemu daemon will reply like: > +++++++++++++++++++++++++++++++++++++++++ > The "event" reply is: > QEMU_MONITOR_RECV_EVENT: mon=0x7fe08401a3f0 event={"timestamp": {"seconds": 1472524518, > "microseconds": 360135}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": > "drive-virtio-disk0", "len": 42949672960, "offset": 10485760, "speed": 0, "type": "mirror"}} > With the condition wakeup signal: > virDomainObjBroadcast(vm) --> virCondBroadcast(&vm->cond) > > The "return" reply is: > QEMU_MONITOR_RECV_REPLY: mon=0x7ff3fc001020 reply={"return": [], "id": "libvirt-338"} > With the condition wakeup signal: > virCondBroadcast(&mon->notify) > ----------------------------------------- > > In this case, when in the first condition wait, the received "event" reply & signal will be > lost and still waiting for the "return" reply & signal. So the when in the second condition > wait step, it will wait forever by holding the lock. Nice analysis. ... > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c > index f9a3b15..6ebaf6b 100644 > --- a/src/qemu/qemu_driver.c > +++ b/src/qemu/qemu_driver.c > @@ -15929,6 +15929,7 @@ qemuDomainBlockJobAbort(virDomainPtr dom, > bool pivot = !!(flags & VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT); > bool async = !!(flags & VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC); > virDomainObjPtr vm; > + unsigned long long now; > int ret = -1; > > virCheckFlags(VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC | > @@ -16018,7 +16019,12 @@ qemuDomainBlockJobAbort(virDomainPtr dom, > qemuDomainDiskPrivatePtr diskPriv = QEMU_DOMAIN_DISK_PRIVATE(disk); > qemuBlockJobUpdate(driver, vm, disk); > while (diskPriv->blockjob) { > - if (virDomainObjWait(vm) < 0) { > + if (virTimeMillisNow(&now) < 0) { > + ret = -1; > + goto endjob; > + } > + > + if (virDomainObjWaitUntil(vm, now + BLOCK_JOB_ABORT_TIMEOUT) < 0) { > ret = -1; > goto endjob; > } However, the patch is wrong. We should just never attempt to wait for the event if we already got it. In other words, diskPriv->blockjob should have already be false at this point. Jirka -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list