On 12/09/13 11:56, Daniel P. Berrange wrote: > On Sat, Dec 07, 2013 at 02:03:00AM -0500, Adam Walters wrote: >> This patch works around a race condition present during >> libvirt startup. The race condition only applies when >> using storage pool volumes for domain disks, and even >> then, only when restarting libvirt with running domains. >> >> The gist of the patch is simply to enter a (limited) >> retry loop during qemuTranslateDiskSourcePool. This >> particular implementation does have a slight drawback, >> though. Within that function, I can not determine if >> we are currently starting libvirt, or if we are just >> starting up a domain. In the latter case, this could >> cause a 800ms delay in reporting an error that the >> storage pool is inactive. >> >> I am happy to report, however, that with this patch, >> domains continue to run without restarts regardless >> of how often I restart libvirt. I have a fairly fast >> hypervisor, and have never seen it try more than one >> iteration of the retry loop unless I purposely set >> one of the storage pools to be inactive. >> --- >> src/qemu/qemu_conf.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c >> index c28908a..2e52fbf 100644 >> --- a/src/qemu/qemu_conf.c >> +++ b/src/qemu/qemu_conf.c >> @@ -1359,6 +1359,8 @@ qemuTranslateDiskSourcePool(virConnectPtr conn, >> virStorageVolInfo info; >> int ret = -1; >> virErrorPtr savedError = NULL; >> + int attempt = 0; >> + int poolAutostart; >> >> if (def->type != VIR_DOMAIN_DISK_TYPE_VOLUME) >> return 0; >> @@ -1369,7 +1371,16 @@ qemuTranslateDiskSourcePool(virConnectPtr conn, >> if (!(pool = virStoragePoolLookupByName(conn, def->srcpool->pool))) >> return -1; >> >> +retry: >> if (virStoragePoolIsActive(pool) != 1) { >> + if (!(virStoragePoolGetAutostart(pool, &poolAutostart) < 0)) >> + if (poolAutostart && attempt < 4) { >> + VIR_DEBUG("Waiting for storage pool '%s' to activate", >> + def->srcpool->pool); >> + usleep(200*1000); /* sleep 200ms */ >> + attempt++; >> + goto retry; >> + } >> virReportError(VIR_ERR_CONFIG_UNSUPPORTED, >> _("storage pool '%s' containing volume '%s' " >> "is not active"), > > This is rather dubious and points toa bug elsewhere. The storage driver > should complete its autostart before the QEMU driver even starts its > own initialization. We definitely need to make sure that storage is available at that point. This particular regression was introduced by my commit e1a4d08baf9a although was latently present for a few releases now as the volume code isn't used that much probably. Prior to my patch that added the check whether the pool is available we were blindly assuming that the pool was online. Pool drivers like gluster don't have their internal structures initialized if the pool isn't started and thus the translation would fail either way. Also the translation function is called separately from the reconnection code, thus we can pass different arguments to it so we don't spoil the normal code paths with unnecessary delays and other stuff. The question is what to do with domains that have storage on pools that can't be initialized. Should we kill those? Should we skip translation of the source and then something later may fail? Peter > > Daniel >
Attachment:
signature.asc
Description: OpenPGP digital signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list