[edited to reduce size to get past moderation] -------- Forwarded Message -------- Date: Wed, 4 Nov 2015 15:57:26 +0100 From: max ustermann <ustermann78@xxxxxx> To: libvirt-users@xxxxxxxxxx Subject: xen/libvirt freeze while attching network-device to vm´s Message-ID: <20151104155726.72fd6109@linux-3go9> Hi, first sorry for the cross-posting, but i don Ìt known exactly if my problem belongs to xen or libvirt. I have an VM-Host with Xen 4.6.0 und libvirt 1.2.19. The Job of the Host is to resume in parallel 20 VM Ìs (HVM WinXp) attach a block-device to each VM , wait 60 Seconds, kill the VM and start from beginning. For that, a python script create for each vm a thread and each thread get a connection to the Libvirt-Daemon via "libvirt.open(None)" and do the resume, attach and so on in a loop. This Setup works fine, it do several thousand cycle without any problems. If i complement this setup, so that each vm get , additonal to the blockdevice an network-interface after resume, via "attachDevice" call from the python-script, i observe that libvirtd/xen freeze after 20 to 100 cycle. Their are no error-messages, no crash, noting in "dmesg" or "xl dmesg". When i attached to the libvirtd via gdb and do a "info thread", i get the following: Id Target Id Frame 21 Thread 0x7f792a92c700 (LWP 1312) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 20 Thread 0x7f792a12b700 (LWP 1313) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 19 Thread 0x7f792992a700 (LWP 1314) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 18 Thread 0x7f7929129700 (LWP 1315) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 17 Thread 0x7f7928928700 (LWP 1316) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 16 Thread 0x7f791bfff700 (LWP 1317) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 15 Thread 0x7f791b7fe700 (LWP 1318) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 14 Thread 0x7f791affd700 (LWP 1319) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 13 Thread 0x7f791a7fc700 (LWP 1320) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 12 Thread 0x7f7919ffb700 (LWP 1321) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 11 Thread 0x7f78fd6ff700 (LWP 1334) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 10 Thread 0x7f78fcefe700 (LWP 1335) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 9 Thread 0x7f78f3fff700 (LWP 1336) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 8 Thread 0x7f78ebfff700 (LWP 1337) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 7 Thread 0x7f78f37fe700 (LWP 1338) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 6 Thread 0x7f793bf94700 (LWP 1494) "libvirtd" 0x00007f79375c87fd in read () at ../sysdeps/unix/syscall-template.S:81 5 Thread 0x7f78f241d700 (LWP 17709) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 4 Thread 0x7f78f1c1c700 (LWP 17717) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 3 Thread 0x7f78f141b700 (LWP 17722) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 2 Thread 0x7f78f0c1a700 (LWP 17751) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 * 1 Thread 0x7f793c0428c0 (LWP 1269) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 i start to examine at thread 1 and get the fowlling bt: #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2 0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f78d80edd30) at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f78d80edd30) at util/virthread.c:89 #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f78d80edd20) at util/virobject.c:323 #5 0x00007f793b0b71d7 in virDomainObjListFindByID (doms=0x7f78d80edd20, id=1) at conf/domain_conf.c:1237 #6 0x00007f78fe3f7eb3 in libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f793c30c2e0) at libxl/libxl_domain.c:534 #7 0x00007f793806f7c0 in egc_run_callbacks (egc=egc@entry=0x7ffdb79c7490) at libxl_event.c:1405 #8 0x00007f793806fb9d in libxl__egc_cleanup (egc=0x7ffdb79c7490) at libxl_event.c:1436 #9 libxl_osevent_occurred_fd (ctx=<optimized out>, for_libxl=0x7f78d80ec220, fd=27, events_ign=<optimized out>, revents_ign=<optimized out>) at libxl_event.c:1339 #10 0x00007f78fe3f9ac5 in libxlFDEventCallback (watch=11, fd=27, vir_events=1, fd_info=0x7f78d811c430) at libxl/libxl_driver.c:134 #11 0x00007f793b04a9c4 in virEventPollDispatchHandles (nfds=31, fds=0x7f793c30b510) at util/vireventpoll.c:509 #12 0x00007f793b04b224 in virEventPollRunOnce () at util/vireventpoll.c:658 #13 0x00007f793b0490cd in virEventRunDefaultImpl () at util/virevent.c:308 #14 0x00007f793b1f5ab6 in virNetDaemonRun (dmn=0x7f793c2e1b80) at rpc/virnetdaemon.c:701 #15 0x00007f793be0bc27 in main (argc=1, argv=0x7ffdb79c7d78) at libvirtd.c:1577 i select frame 3 and look who holds the lock: (gdb) frame 3 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f78d80edd30) at util/virthread.c:89 89 pthread_mutex_lock(&m->lock); (gdb) p *m $5 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312, __nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000 \005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>, __align = 2}} i switch to frame 21 (LWP1312) i taek a look to the bt: (gdb) bt #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2 0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f79140099d0) at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at util/virobject.c:323 #5 0x00007f793b0b717f in virDomainObjListSearchID (payload=0x7f79140099c0, name=0x7f791400f7c0, data=0x7f792a92b464) at conf/domain_conf.c:1225 #6 0x00007f793b055f1e in virHashSearch (ctable=0x7f78d81ccc40, iter=0x7f793b0b7148 <virDomainObjListSearchID>, data=0x7f792a92b464) at util/virhash.c:724 #7 0x00007f793b0b71f2 in virDomainObjListFindByID (doms=0x7f78d80edd20, id=2) at conf/domain_conf.c:1238 #8 0x00007f78fe3f7eb3 in libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f791c005610) at libxl/libxl_domain.c:534 #9 0x00007f793806f7c0 in egc_run_callbacks (egc=egc@entry=0x7f792a92b580) at libxl_event.c:1405 #10 0x00007f7938070f8b in libxl__egc_cleanup (egc=0x7f792a92b580) at libxl_event.c:1436 #11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add") at libxl_event.c:2014 #12 0x00007f7938048fa6 in libxl_device_nic_add (ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0, ao_how=<optimized out>) at libxl.c:4226 #13 0x00007f78fe402baa in libxlDomainAttachNetDevice (driver=0x7f78d81ce940, vm=0x7f79140099c0, net=0x7f791c003610) at libxl/libxl_driver.c:3199 #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 #15 0x00007f78fe404135 in libxlDomainAttachDeviceFlags (dom=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>", flags=1) at libxl/libxl_driver.c:3707 #16 0x00007f78fe404239 in libxlDomainAttachDevice (dom=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>") at libxl/libxl_driver.c:3745 #17 0x00007f793b18422c in virDomainAttachDevice (domain=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>") at libvirt-domain.c:8351 #18 0x00007f793be11a1f in remoteDispatchDomainAttachDevice (server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, args=0x7f791c009190) at remote_dispatch.h:2850 #19 0x00007f793be11923 in remoteDispatchDomainAttachDeviceHelper (server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, args=0x7f791c009190, ret=0x7f791c009130) at remote_dispatch.h:2826 #20 0x00007f793b1fccfc in virNetServerProgramDispatchCall (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:437 #21 0x00007f793b1fc863 in virNetServerProgramDispatch (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:307 #22 0x00007f793b1f5e6d in virNetServerProcessMsg (srv=0x7f793c2e0a90, client=0x7f793c2f34c0, prog=0x7f793c2eefd0, msg=0x7f793c2feca0) at rpc/virnetserver.c:135 #23 0x00007f793b1f5f3a in virNetServerHandleJob (jobOpaque=0x7f793c302c60, opaque=0x7f793c2e0a90) at rpc/virnetserver.c:156 #24 0x00007f793b09d9f7 in virThreadPoolWorker (opaque=0x7f793c2e0750) at util/virthreadpool.c:145 #25 0x00007f793b09d00a in virThreadHelper (data=0x7f793c2eced0) at util/virthread.c:206 #26 0x00007f79375c1ee5 in start_thread (arg=0x7f792a92c700) at pthread_create.c:309 #27 0x00007f79372f0d1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 here i select frame 3 and look who holds the lock: (gdb) frame 3 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 89 pthread_mutex_lock(&m->lock); $6 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312, __nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000 \005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>, __align = 2}} so i see that thread 21 waits for a lock that holds by thread 21 from frame 4 i see, that the process want a lock for vm object: #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at util/virobject.c:323 further down in the bt i see in frame 14: #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 that vm-object 0x7f79140099c0 is the object for that also a network-device should be attached. And from that i know that there is a lock on the vm-object. so i go up in the bt an see that libxlDomainAttachDeviceLive(libvirtd) (frame14) calls libxlDomainAttachNetDevice(libvirtd) (frame13), that calls libxl_device_nic_add(libxl/xen) (frame12) from info frame for frame12 i see that he calls frame 0x7f792a92b5f0: (gdb) info f Stack level 12, frame at 0x7f792a92b670: rip = 0x7f7938048fa6 in libxl_device_nic_add (libxl.c:4226); saved rip = 0x7f78fe402baa called by frame at 0x7f792a92b760, caller of frame at 0x7f792a92b5f0 source language c. Arglist at 0x7f792a92b5e8, args: ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0, ao_how=<optimized out> Locals at 0x7f792a92b5e8, Previous frame's sp is 0x7f792a92b670 Saved registers: rbx at 0x7f792a92b640, rbp at 0x7f792a92b648, r12 at 0x7f792a92b650, r13 at 0x7f792a92b658, r14 at 0x7f792a92b660, rip at 0x7f792a92b668 and this is frame 11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add") at libxl_event.c:2014 so further i see that libxl__ao_inprogress calls libxl__egc_cleanup and that calls egc_run_callbacks, frame 9 info frame for frame 9 gives: (gdb) info f Stack level 9, frame at 0x7f792a92b540: rip = 0x7f793806f7c0 in egc_run_callbacks (libxl_event.c:1405); saved rip = 0x7f7938070f8b called by frame at 0x7f792a92b5f0, caller of frame at 0x7f792a92b500 source language c. Arglist at 0x7f792a92b4f8, args: egc=egc@entry=0x7f792a92b580 Locals at 0x7f792a92b4f8, Previous frame's sp is 0x7f792a92b540 Saved registers: rbx at 0x7f792a92b520, rbp at 0x7f792a92b528, r12 at 0x7f792a92b530, rip at 0x7f792a92b538 and list gives: (gdb) list 1400 1401 LIBXL_TAILQ_FOREACH_SAFE(ev, &egc->occurred_for_callback, link, ev_tmp) { 1402 LIBXL_TAILQ_REMOVE(&egc->occurred_for_callback, ev, link); 1403 LOG(DEBUG,"event %p callback type=%s", 1404 ev, libxl_event_type_to_string(ev->type)); 1405 CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev); 1406 } 1407 1408 LIBXL_TAILQ_FOREACH_SAFE(aop, &egc->aops_for_callback, entry, aop_tmp) { 1409 LIBXL_TAILQ_REMOVE(&egc->aops_for_callback, aop, entry); from here i see that egc_run_callbacks is at line 1405 " CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev);" , but i have no idea what he is doing there and while it "hangs" here!?! if i see it correctly, the systems freeze/hangs because it waits of the release of the lock for the vm which actuall attach a network-device, and the process of attaching the network-device didn Ìt finished and hold the lock. as i say at the geginning of my mail, if i only attach a block-device to the vm`s every thing runs, and with additional attaching a network-device after 20 to 100 resume , it freeze´s. i am gratefull for any help,hints and tips for further debugging. all the best max for completness , here are the output of "(gdb) thread apply all bt full": [trimmed by moderator] -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list