On 11/04/2015 08:29 AM, Eric Blake wrote: > [edited to reduce size to get past moderation] > > > -------- Forwarded Message -------- > > Date: Wed, 4 Nov 2015 15:57:26 +0100 > From: max ustermann <ustermann78@xxxxxx> > To: libvirt-users@xxxxxxxxxx > Subject: xen/libvirt freeze while attching network-device to vm´s > Message-ID: <20151104155726.72fd6109@linux-3go9> > > Hi, > > first sorry for the cross-posting, but i don Ìt known exactly if my > problem belongs to xen or libvirt. AFAICT, this was only posted to libvirt-users, but will respond to Eric's forwarded message since the issue is more suited for the libvirt dev list. First, thanks for the report and detailed info. > I have an VM-Host with Xen 4.6.0 und libvirt 1.2.19. The Job of the > Host is to resume in parallel 20 VM Ìs (HVM WinXp) attach a > block-device to each VM , wait 60 Seconds, kill the VM and start from > beginning. For that, a python script create for each vm a thread and > each thread get a connection to the Libvirt-Daemon via > "libvirt.open(None)" and do the resume, attach and so on in a loop. > > > This Setup works fine, it do several thousand cycle without any > problems. If i complement this setup, so that each vm get , additonal > to the blockdevice an network-interface after resume, via > "attachDevice" call from the python-script, i observe that libvirtd/xen > freeze after 20 to 100 cycle. Their are no error-messages, no crash, > noting in "dmesg" or "xl dmesg". > > When i attached to the libvirtd via gdb [...] > (gdb) bt > #0 __lll_lock_wait () > at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 > 0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2 > 0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f79140099d0) > at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in > virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 #4 > 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at > util/virobject.c:323 #5 0x00007f793b0b717f in virDomainObjListSearchID > (payload=0x7f79140099c0, name=0x7f791400f7c0, data=0x7f792a92b464) at > conf/domain_conf.c:1225 #6 0x00007f793b055f1e in virHashSearch > (ctable=0x7f78d81ccc40, iter=0x7f793b0b7148 <virDomainObjListSearchID>, > data=0x7f792a92b464) at util/virhash.c:724 #7 0x00007f793b0b71f2 in > virDomainObjListFindByID (doms=0x7f78d80edd20, id=2) at > conf/domain_conf.c:1238 #8 0x00007f78fe3f7eb3 in > libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f791c005610) at > libxl/libxl_domain.c:534 #9 0x00007f793806f7c0 in egc_run_callbacks > (egc=egc@entry=0x7f792a92b580) at libxl_event.c:1405 #10 > 0x00007f7938070f8b in libxl__egc_cleanup (egc=0x7f792a92b580) at > libxl_event.c:1436 #11 libxl__ao_inprogress > (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c", > line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288> > "libxl_device_nic_add") at libxl_event.c:2014 #12 0x00007f7938048fa6 in > libxl_device_nic_add (ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0, > ao_how=<optimized out>) at libxl.c:4226 #13 0x00007f78fe402baa in > libxlDomainAttachNetDevice (driver=0x7f78d81ce940, vm=0x7f79140099c0, > net=0x7f791c003610) at libxl/libxl_driver.c:3199 #14 0x00007f78fe402d12 > in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940, > vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 #15 > 0x00007f78fe404135 in libxlDomainAttachDeviceFlags (dom=0x7f791c009150, > xml=0x7f791c009230 "<interface type=\"bridge\">\n <source > bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac > address=\"00:27:0e:0c:01:09\" />\n </interface>", flags=1) at > libxl/libxl_driver.c:3707 #16 0x00007f78fe404239 in > libxlDomainAttachDevice (dom=0x7f791c009150, xml=0x7f791c009230 > "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n > <model type=\"rtl8139\" />\n <mac > address=\"00:27:0e:0c:01:09\" />\n </interface>") at > libxl/libxl_driver.c:3745 #17 0x00007f793b18422c in > virDomainAttachDevice (domain=0x7f791c009150, xml=0x7f791c009230 > "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n > <model type=\"rtl8139\" />\n <mac > address=\"00:27:0e:0c:01:09\" />\n </interface>") at > libvirt-domain.c:8351 #18 0x00007f793be11a1f in > remoteDispatchDomainAttachDevice (server=0x7f793c2e0a90, > client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, > args=0x7f791c009190) at remote_dispatch.h:2850 #19 0x00007f793be11923 > in remoteDispatchDomainAttachDeviceHelper (server=0x7f793c2e0a90, > client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, > args=0x7f791c009190, ret=0x7f791c009130) at remote_dispatch.h:2826 #20 > 0x00007f793b1fccfc in virNetServerProgramDispatchCall > (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0, > msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:437 #21 > 0x00007f793b1fc863 in virNetServerProgramDispatch (prog=0x7f793c2eefd0, > server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at > rpc/virnetserverprogram.c:307 #22 0x00007f793b1f5e6d in > virNetServerProcessMsg (srv=0x7f793c2e0a90, client=0x7f793c2f34c0, > prog=0x7f793c2eefd0, msg=0x7f793c2feca0) at rpc/virnetserver.c:135 #23 > 0x00007f793b1f5f3a in virNetServerHandleJob (jobOpaque=0x7f793c302c60, > opaque=0x7f793c2e0a90) at rpc/virnetserver.c:156 #24 0x00007f793b09d9f7 > in virThreadPoolWorker (opaque=0x7f793c2e0750) at > util/virthreadpool.c:145 #25 0x00007f793b09d00a in virThreadHelper > (data=0x7f793c2eced0) at util/virthread.c:206 #26 0x00007f79375c1ee5 in > start_thread (arg=0x7f792a92c700) at pthread_create.c:309 #27 > 0x00007f79372f0d1d in clone () > at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 As you described below, the virDomainObj is locked in frame#15 in libxlDomainAttachDeviceFlags(). In frame#12, libxl_device_nic_add() is called, which may also process pending events by calling a registered 'event_occurs' handler. In this case it's libxlDomainEventHandler() in frame#8. I suspect the event is 'shutdown', caused by one of your other threads killing off a domain. Actually, it has to be a shutdown event since libxlDomainEventHandler() will simply free the event and return when it is not shutdown. libxlDomainEventHandler() calls virDomainObjListFindByID() to retrieve the virDomainObj associated with the domain ID provided by libxl. Searching the domain obj list results in locking each obj before checking if it is active and its ID equals the requested ID. And here is where we deadlock. One solution that comes to mind is processing events in a thread using libxl_event_wait(), instead of registering an event handler with libxl. Events could then be processed asynchronously in libvirt, avoiding the deadlock. I thought about this in the past, when changing the libxl driver to use a single libxl_ctx, but was deterred by the following comment about libxl_event_wait() in libxl_event.h /* Like libxl_event_check but blocks if no suitable events are * available, until some are. Uses libxl_osevent_beforepoll/ * _afterpoll so may be inefficient if very many domains are being * handled by a single program. */ Ian, you've become a bit familiar with libvirt's use of libxl over the years. Would you recommend using libxl_event_wait() instead of the current event handler registration approach? BTW, I think this deadlock is a specific instance of a more general problem in the libvirt libxl driver. Seems any function that has the virDomainObj locked and calls a libxl_* function that also processes pending events can potentially deadlock. Regards, Jim > > here i select frame 3 and look who holds the lock: > (gdb) frame 3 > #3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at > util/virthread.c:89 89 pthread_mutex_lock(&m->lock); > $6 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312, > __nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list = > {__prev = 0x0, __next = 0x0}}, __size = > "\002\000\000\000\000\000\000\000 > \005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>, > __align = 2}} > > so i see that thread 21 waits for a lock that holds by thread 21 > > from frame 4 i see, that the process want a lock for vm object: > #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at > util/virobject.c:323 > > further down in the bt i see in frame 14: > #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive > (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at > libxl/libxl_driver.c:3231 that vm-object 0x7f79140099c0 is the object > for that also a network-device should be attached. And from that i know > that there is a lock on the vm-object. > > so i go up in the bt an see that libxlDomainAttachDeviceLive(libvirtd) > (frame14) calls libxlDomainAttachNetDevice(libvirtd) (frame13), that > calls libxl_device_nic_add(libxl/xen) (frame12) > > from info frame for frame12 i see that he calls frame 0x7f792a92b5f0: > (gdb) info f > Stack level 12, frame at 0x7f792a92b670: > rip = 0x7f7938048fa6 in libxl_device_nic_add (libxl.c:4226); saved rip > = 0x7f78fe402baa called by frame at 0x7f792a92b760, caller of frame at > 0x7f792a92b5f0 source language c. > Arglist at 0x7f792a92b5e8, args: ctx=0x7f78d8112830, domid=19, > nic=0x7f792a92b6b0, ao_how=<optimized out> Locals at 0x7f792a92b5e8, > Previous frame's sp is 0x7f792a92b670 Saved registers: > rbx at 0x7f792a92b640, rbp at 0x7f792a92b648, r12 at 0x7f792a92b650, > r13 at 0x7f792a92b658, r14 at 0x7f792a92b660, rip at 0x7f792a92b668 > > and this is frame 11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0, > file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226, > func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add") > at libxl_event.c:2014 > > so further i see that libxl__ao_inprogress calls libxl__egc_cleanup and > that calls egc_run_callbacks, frame 9 info frame for frame 9 gives: > (gdb) info f > Stack level 9, frame at 0x7f792a92b540: > rip = 0x7f793806f7c0 in egc_run_callbacks (libxl_event.c:1405); saved > rip = 0x7f7938070f8b called by frame at 0x7f792a92b5f0, caller of frame > at 0x7f792a92b500 source language c. > Arglist at 0x7f792a92b4f8, args: egc=egc@entry=0x7f792a92b580 > Locals at 0x7f792a92b4f8, Previous frame's sp is 0x7f792a92b540 > Saved registers: > rbx at 0x7f792a92b520, rbp at 0x7f792a92b528, r12 at 0x7f792a92b530, > rip at 0x7f792a92b538 > > and list gives: > (gdb) list > 1400 > 1401 LIBXL_TAILQ_FOREACH_SAFE(ev, &egc->occurred_for_callback, > link, ev_tmp) { 1402 > LIBXL_TAILQ_REMOVE(&egc->occurred_for_callback, ev, link); > 1403 LOG(DEBUG,"event %p callback type=%s", > 1404 ev, libxl_event_type_to_string(ev->type)); > 1405 CTX->event_hooks->event_occurs(CTX->event_hooks_user, > ev); 1406 } 1407 > 1408 LIBXL_TAILQ_FOREACH_SAFE(aop, &egc->aops_for_callback, > entry, aop_tmp) { 1409 > LIBXL_TAILQ_REMOVE(&egc->aops_for_callback, aop, entry); > > from here i see that egc_run_callbacks is at line 1405 " > CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev);" , but i > have no idea what he is doing there and while it "hangs" here!?! > > if i see it correctly, the systems freeze/hangs because it waits of the > release of the lock for the vm which actuall attach a network-device, > and the process of attaching the network-device didn Ìt finished and > hold the lock. as i say at the geginning of my mail, if i only attach a > block-device to the vm`s every thing runs, and with additional > attaching a network-device after 20 to 100 resume , it freeze´s. > > i am gratefull for any help,hints and tips for further debugging. > > all the best > max > > for completness , here are the output of "(gdb) thread apply all bt > full": > > [trimmed by moderator] > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list