At 07/28/2011 05:41 AM, Eric Blake Write: > On 07/07/2011 05:34 PM, Jiri Denemark wrote: >> This series is also available at >> https://gitorious.org/~jirka/libvirt/jirka-staging/commits/migration-recovery >> >> >> The series does several things: >> - persists current job and its phase in status xml >> - allows safe monitor commands to be run during migration/save/dump jobs >> - implements recovery when libvirtd is restarted while a job is active >> - consolidates some code and fixes bugs I found when working in the area > > git bisect is pointing to this series as the cause of a regression in > 'virsh managedsave dom' triggering libvirtd core dumps if some other > process is actively making queries on domain at the same time > (virt-manager is a great process for fitting that bill). I'm trying to > further narrow down which patch introduced the regression, and see if I > can plug the race (probably a case of not checking whether the monitor > still exists when getting the condition for an asynchronous job, since > the whole point of virsh [managed]save is that the domain will go away > when the save completes, but that it is time-consuming enough that we > want to query domain state in the meantime). I can reproduce this bug. > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7ffff06d0700 (LWP 11419)] > 0x00000000004b9ad8 in qemuMonitorSend (mon=0x7fffe815c060, > msg=0x7ffff06cf380) > at qemu/qemu_monitor.c:801 > 801 while (!mon->msg->finished) { The reason is that mon->msg is NULL. I add some debug codes, and found that we send monitor command while the last command is not finished, and then libvirtd crashed. After reading the code, I think something is wrong in the function qemuDomainObjEnterMonitorInternal(): if (priv->job.active == QEMU_JOB_NONE && priv->job.asyncJob) { if (qemuDomainObjBeginNestedJob(driver, obj) < 0) We can run query job while asyncJob is running. When we query the migration's status, priv->job.active is not QEMU_JOB_NONE, and we do not wait the query job finished. So we send monitor command while last command is not finished. It's very dangerous. When we run a async job, we can not know whether the job is nested async job according to priv->job.active's value. I think we should introduce four functions for async nested job: qemuDomainObjAsyncEnterMonitor() qemuDomainObjAsyncEnterMonitorWithDriver() qemuDomainObjAsyncExitMonitor() qemuDomainObjAsyncExitMonitorWithDriver() The qemuDomainObjEnterMonitorInternal()'s caller should pass a bool value to tell qemuDomainObjEnterMonitorInternal() whether the job is a async nested job. Thanks Wen Congyang. > (gdb) bt > #0 0x00000000004b9ad8 in qemuMonitorSend (mon=0x7fffe815c060, > msg=0x7ffff06cf380) at qemu/qemu_monitor.c:801 > #1 0x00000000004c77ae in qemuMonitorJSONCommandWithFd (mon=0x7fffe815c060, > cmd=0x7fffd8000940, scm_fd=-1, reply=0x7ffff06cf480) > at qemu/qemu_monitor_json.c:225 > #2 0x00000000004c78e5 in qemuMonitorJSONCommand (mon=0x7fffe815c060, > cmd=0x7fffd8000940, reply=0x7ffff06cf480) at > qemu/qemu_monitor_json.c:254 > #3 0x00000000004cc19c in qemuMonitorJSONGetMigrationStatus ( > mon=0x7fffe815c060, status=0x7ffff06cf580, transferred=0x7ffff06cf570, > remaining=0x7ffff06cf568, total=0x7ffff06cf560) > at qemu/qemu_monitor_json.c:1920 > #4 0x00000000004bc1b3 in qemuMonitorGetMigrationStatus > (mon=0x7fffe815c060, > status=0x7ffff06cf580, transferred=0x7ffff06cf570, > remaining=0x7ffff06cf568, total=0x7ffff06cf560) at > qemu/qemu_monitor.c:1532 > #5 0x00000000004b201b in qemuMigrationUpdateJobStatus > (driver=0x7fffe80089f0, > vm=0x7fffe8015cd0, job=0x5427b6 "domain save job") > at qemu/qemu_migration.c:765 > #6 0x00000000004b2383 in qemuMigrationWaitForCompletion ( > driver=0x7fffe80089f0, vm=0x7fffe8015cd0) at qemu/qemu_migration.c:846 > #7 0x00000000004b7806 in qemuMigrationToFile (driver=0x7fffe80089f0, > vm=0x7fffe8015cd0, fd=27, offset=4096, > path=0x7fffd8000990 "/var/lib/libvirt/qemu/save/fedora_12.save", > compressor=0x0, is_reg=true, bypassSecurityDriver=true) > at qemu/qemu_migration.c:2766 > #8 0x000000000046a90d in qemuDomainSaveInternal (driver=0x7fffe80089f0, > dom=0x7fffd8000ad0, vm=0x7fffe8015cd0, > path=0x7fffd8000990 "/var/lib/libvirt/qemu/save/fedora_12.save", > compressed=0, bypass_cache=false) at qemu/qemu_driver.c:2386 > > -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list