When doing the snapshot using the script below: ========================================= !#/bin/bash virsh blockjob $instance_name vdX --abort virsh undefine $instance_name qemu-img create -f qcow2 -o backing_file=$diskfile,size=$size $path/$uuid.dlta virsh blockcopy --domain $instance_name vdX $path/$uuid.dlta --wait --reuse-external --verbose virsh blockjob $instance_name vdX --abort qemu-img convert -f qcow2 -O qcow2 $path/$uuid.dlta $path/$uuid.qcow2 echo "start uploading at $(date)" ...... ========================================= Sometimes you can encounter the following warning and error logs: +++++++++++++++++++++++++++++++++++++++++ 2016-08-26 07:51:05.685+0000: 8916: warning : qemuDomainObjBeginJobInternal:1571 : Cannot start job (query, none) for domain instance-00001b3a; current job is (modify, none) owned by (8914 remoteDispatchDomainBlockJobAbort, 0 <null>) for (30s, 0s) 2016-08-26 07:51:05.685+0000: 8916: error : qemuDomainObjBeginJobInternal:1583 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainBlockJobAbort) ----------------------------------------- Mostly it will be okay later. But sometimes the state change lock maybe hold by remoteDispatchDomainBlockJobAbort forever, then the instance couldn't be connected through the VNC or the network(ping,ssh...), expect after rebooting the instance. >From the code, we find that after sending the --abort command, there will be two steps by condition waiting the two reply signals: +++++++++++++++++++++++++++++++++++++++++ The first condition wait is: In qemuMonitorBlockJobCancel() --> qemuMonitoJSONBlockJobCancel() --> qemuMonitorJSONCommand() --> qemuMonitorJSONCommandWithFd() --> qemuMonitorSend() --> virCondWait(&mon->notify, &mon->parent.lock). The second condition wait is: In virDomainCondWait(vm) --> virCondWait(&vm->cond, &vm->parent.lock). ----------------------------------------- The two corresponding replies are: +++++++++++++++++++++++++++++++++++++++++ The "return" reply is: QEMU_MONITOR_RECV_REPLY: mon=0x7ff3fc001020 reply={"return": [], "id": "libvirt-338"} With the condition wakeup signal: virCondBroadcast(&mon->notify) The "event" reply is: QEMU_MONITOR_RECV_EVENT: mon=0x7fe08401a3f0 event={"timestamp": {"seconds": 1472524518, "microseconds": 360135}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive-virtio-disk0", "len": 42949672960, "offset": 10485760, "speed": 0, "type": "mirror"}} With the condition wakeup signal: virDomainObjBroadcast(vm) --> virCondBroadcast(&vm->cond) ----------------------------------------- But sometimes the qemu daemon will reply like: +++++++++++++++++++++++++++++++++++++++++ The "event" reply is: QEMU_MONITOR_RECV_EVENT: mon=0x7fe08401a3f0 event={"timestamp": {"seconds": 1472524518, "microseconds": 360135}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive-virtio-disk0", "len": 42949672960, "offset": 10485760, "speed": 0, "type": "mirror"}} With the condition wakeup signal: virDomainObjBroadcast(vm) --> virCondBroadcast(&vm->cond) The "return" reply is: QEMU_MONITOR_RECV_REPLY: mon=0x7ff3fc001020 reply={"return": [], "id": "libvirt-338"} With the condition wakeup signal: virCondBroadcast(&mon->notify) ----------------------------------------- In this case, when in the first condition wait, the received "event" reply & signal will be lost and still waiting for the "return" reply & signal. So the when in the second condition wait step, it will wait forever by holding the lock. This patch fix this bug. Signed-off-by: Xiubo Li <lixiubo@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Zhuoyu Zhang <zhangzhuoyu@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Wei Tang <tangwei@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Yaowei Bai <baiyaowei@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Qide Chen <chenqide@xxxxxxxxxxxxxxxxxxxx> --- src/qemu/qemu_domain.h | 2 ++ src/qemu/qemu_driver.c | 8 +++++++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index c49f31c..2624c6d 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -61,6 +61,8 @@ (JOB_MASK(QEMU_JOB_DESTROY) | \ JOB_MASK(QEMU_JOB_ASYNC)) +# define BLOCK_JOB_ABORT_TIMEOUT 5000 + /* Only 1 job is allowed at any time * A job includes *all* monitor commands, even those just querying * information, not merely actions */ diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index f9a3b15..6ebaf6b 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -15929,6 +15929,7 @@ qemuDomainBlockJobAbort(virDomainPtr dom, bool pivot = !!(flags & VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT); bool async = !!(flags & VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC); virDomainObjPtr vm; + unsigned long long now; int ret = -1; virCheckFlags(VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC | @@ -16018,7 +16019,12 @@ qemuDomainBlockJobAbort(virDomainPtr dom, qemuDomainDiskPrivatePtr diskPriv = QEMU_DOMAIN_DISK_PRIVATE(disk); qemuBlockJobUpdate(driver, vm, disk); while (diskPriv->blockjob) { - if (virDomainObjWait(vm) < 0) { + if (virTimeMillisNow(&now) < 0) { + ret = -1; + goto endjob; + } + + if (virDomainObjWaitUntil(vm, now + BLOCK_JOB_ABORT_TIMEOUT) < 0) { ret = -1; goto endjob; } -- 1.8.3.1 -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list