Tsan is a heavilly reworked version of helgrind the data race detector based on Valgrind (at least for linux) It is developped by Google and described/available there: http://code.google.com/p/data-race-test/wiki/ThreadSanitizer as libvirt code is highly threaded I though that we should try testing it. Helgrind used to not be very practicable but Google suggest they significantly speed up the process and decreased the number of false positive. I tried to build it from source, but it failed in some ways by not finding the i386/686 libraries build when run on my x86_64, could be related to differences in distribution, they use only Ubuntu, and I tested on Fedora 16, I didn't tried to really solve the issue at that point. One of the problem of such a configuration is that the program doesn't have what could be specific Fedora exception lists for valgrind. So I grabbed the prebuilt version tsan-r4363-amd64-linux-self-contained.sh (it's a self extracting binary in shell script) from http://build.chromium.org/p/client.tsan/binaries/ renamid tsan.sh and installed it in my path, works fine... I did a first test run with libxml2 code I know very well and the result were good, i.e. it pointed out a data race I know about and raised another one about some global data initialization, so fairly positive from my POV. Then I instrumented libvirt daemon for 0.9.12 release, launched it as root from command line as a tsan argument: /tmp/tsan.sh --trace-children=no /usr/sbin/libvirtd and then played with virt-manager to exercise it a bit. At first I tried to use --log-file=tsan.log option to collect the logs but the logfile got corrupted so the following is just cut and paste from the output One of the ouput raised frequently is: ---------------------------------------------------------------------- ==5838== WARNING: Possible data race during write of size 4 at 0x34723B3BD0: {{{ ==5838== T1 (L{L17, L125}): ==5838== #0 ptmalloc_unlock_all2 /lib64/libc-2.14.90.so ==5838== #1 fork /lib64/libc-2.14.90.so ==5838== #2 virFork /usr/src/debug/libvirt-0.9.12/src/util/command.c:268 ==5838== #3 virExecWithHook.constprop.5 /usr/src/debug/libvirt-0.9.12/src/util/command.c:489 ==5838== #4 virCommandRunAsync /usr/src/debug/libvirt-0.9.12/src/util/command.c:2177 ==5838== #5 virCommandRun /usr/src/debug/libvirt-0.9.12/src/util/command.c:1970 ==5838== #6 qemuCapsExtractVersionInfo /usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1379 ==5838== #7 qemuCapsExtractVersion /usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1568 ==5838== #8 qemudGetVersion /usr/src/debug/libvirt-0.9.12/src/qemu/qemu_driver.c:1320 ==5838== #9 virConnectGetVersion /usr/src/debug/libvirt-0.9.12/src/libvirt.c:1640 ==5838== #10 remoteDispatchGetVersionHelper /usr/src/debug/libvirt-0.9.12/daemon/remote_dispatch.h:6862 ==5838== #11 virNetServerProgramDispatch /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416 ==5838== Concurrent read(s) happened at (OR AFTER) these points: ==5838== T11 (L{}): ==5838== #0 arena_thread_freeres /lib64/libc-2.14.90.so ==5838== #1 __libc_thread_freeres /lib64/libc-2.14.90.so ==5838== #2 start_thread /lib64/libpthread-2.14.90.so ==5838== Address 0x34723B3BD0 is 0 bytes inside data symbol "list_lock" ==5838== Locks involved in this report (reporting last lock sites): {L17, L125} ==5838== L17 (0x5339B00) ==5838== #0 pthread_mutex_lock /mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935 ==5838== #1 virFork /usr/src/debug/libvirt-0.9.12/src/util/command.c:266 ==5838== #2 virExecWithHook.constprop.5 /usr/src/debug/libvirt-0.9.12/src/util/command.c:489 ==5838== #3 virCommandRunAsync /usr/src/debug/libvirt-0.9.12/src/util/command.c:2177 ==5838== #4 virCommandRun /usr/src/debug/libvirt-0.9.12/src/util/command.c:1970 ==5838== #5 qemuCapsExtractVersionInfo /usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1379 ==5838== #6 qemuCapsExtractVersion /usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1568 ==5838== #7 qemudGetVersion /usr/src/debug/libvirt-0.9.12/src/qemu/qemu_driver.c:1320 ==5838== #8 virConnectGetVersion /usr/src/debug/libvirt-0.9.12/src/libvirt.c:1640 ==5838== #9 remoteDispatchGetVersionHelper /usr/src/debug/libvirt-0.9.12/daemon/remote_dispatch.h:6862 ==5838== #10 virNetServerProgramDispatch /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416 ==5838== #11 virNetServerHandleJob /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161 ==5838== L125 (0x4C0A0A00) ==5838== #0 pthread_mutex_lock /mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935 ==5838== #1 qemudGetVersion /usr/src/debug/libvirt-0.9.12/src/qemu/qemu_driver.c:1319 ==5838== #2 virConnectGetVersion /usr/src/debug/libvirt-0.9.12/src/libvirt.c:1640 ==5838== #3 remoteDispatchGetVersionHelper /usr/src/debug/libvirt-0.9.12/daemon/remote_dispatch.h:6862 ==5838== #4 virNetServerProgramDispatch /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416 ==5838== #5 virNetServerHandleJob /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161 ==5838== #6 virThreadPoolWorker /usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144 ==5838== #7 virThreadHelper /usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161 ==5838== Race verifier data: 0x347207CF98,0x347215D6BB ==5838== }}} I don't know what "Race verifier data" means but it seems that we do *pid = fork(); in the main thread and that another thread is accessing without synchronization. The probram properly report that 2 locks are held at that point, the virLogLock() and qemuDriverLock() and give full stack trace on where those locks were taken. That's convenient but I could not really spot where the second thread did the access, I suspect that's somehow related to libpthread considering the minimal stack trace of the second thread T11 ... Unfortunately, it seems that each time we fork we hit one or two issue like that (which could be false positive), and tend to fill logs relatively fast :-\ I however found out a couple of issues not related to libpthread: ==5733== WARNING: Possible data race during write of size 1 at 0x34723B6B14: {{{ ==5733== T5 (L{}): ==5733== #0 Replace_memcpy /mnt/data/build/slave/full_linux_build/build/tsan/ts_replace.h:113 ==5733== #1 __GI_memcpy /mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:2359 ==5733== #2 xdrmem_getbytes /lib64/libc-2.14.90.so ==5733== #3 xdr_string /lib64/libc-2.14.90.so ==5733== #4 xdr_remote_nonnull_string /usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:31 ==5733== #5 xdr_remote_nonnull_domain /usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:58 ==5733== #6 xdr_remote_domain_get_info_args /usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:1390 ==5733== #7 virNetMessageDecodePayload /usr/src/debug/libvirt-0.9.12/src/rpc/virnetmessage.c:382 ==5733== #8 virNetServerProgramDispatch /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:403 ==5733== #9 virNetServerHandleJob /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161 ==5733== #10 virThreadPoolWorker /usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144 ==5733== #11 virThreadHelper /usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161 ==5733== Concurrent write(s) happened at (OR AFTER) these points: ==5733== T2 (L{}): ==5733== #0 xdrmem_getbytes /lib64/libc-2.14.90.so ==5733== #1 xdr_string /lib64/libc-2.14.90.so ==5733== #2 xdr_remote_nonnull_string /usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:31 ==5733== #3 xdr_remote_nonnull_domain /usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:58 ==5733== #4 xdr_remote_domain_get_xml_desc_args /usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:1532 ==5733== #5 virNetMessageDecodePayload /usr/src/debug/libvirt-0.9.12/src/rpc/virnetmessage.c:382 ==5733== #6 virNetServerProgramDispatch /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:403 ==5733== #7 virNetServerHandleJob /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161 ==5733== #8 virThreadPoolWorker /usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144 ==5733== #9 virThreadHelper /usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161 ==5733== Address 0x34723B6B14 is 0 bytes inside data symbol "crud.6424" ==5733== Race verifier data: 0x4A091CE,0x347212579D ==5733== }}} that one could potentially be nasty, it seems that 2 threads decoding rpc payload in parallel hit an unsynchronized write access in the xdr routine for xdr_remote_nonnull_string() . I didn't had the debuginfo for glibc installed when I ran it so I don't have the exact informations at the top of the stacks The other one raised is actually something I had initially found with the first run on libxml2 so it's relatively consistent: ==5733== WARNING: Possible data race during read of size 4 at 0x34723B10B4: {{{ ==5733== T1 (L{L173}): ==5733== #0 random_r /lib64/libc-2.14.90.so ==5733== #1 random /lib64/libc-2.14.90.so ==5733== #2 rand /lib64/libc-2.14.90.so ==5733== #3 xmlDictCreate /usr/src/debug/libxml2-2.8.0/dict.c:491 ==5733== #4 xmlInitParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1620 ==5733== #5 xmlNewParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1874 ==5733== #6 xmlCreateURLParserCtxt /usr/src/debug/libxml2-2.8.0/parser.c:13546 ==5733== #7 xmlReadFile /usr/src/debug/libxml2-2.8.0/parser.c:14713 ==5733== #8 xmlRelaxNGParse /usr/src/debug/libxml2-2.8.0/relaxng.c:7480 ==5733== #9 (no symbols) /usr/lib64/libnetcf.so.1.4.0 ==5733== #10 ncf_init /usr/lib64/libnetcf.so.1.4.0 ==5733== #11 interfaceOpenInterface /usr/src/debug/libvirt-0.9.12/src/interface/netcf_driver.c:144 ==5733== Concurrent write(s) happened at (OR AFTER) these points: ==5733== T10 (L{L156}): ==5733== #0 random_r /lib64/libc-2.14.90.so ==5733== #1 random /lib64/libc-2.14.90.so ==5733== #2 rand /lib64/libc-2.14.90.so ==5733== #3 xmlHashCreate /usr/src/debug/libxml2-2.8.0/hash.c:193 ==5733== #4 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:4125 ==5733== #5 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6380 ==5733== #6 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523 ==5733== #7 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6319 ==5733== #8 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6520 ==5733== #9 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523 ==5733== Address 0x34723B10B4 is 20 bytes inside data symbol "randtbl" ==5733== Locks involved in this report (reporting last lock sites): {L156, L173} ==5733== L156 (0x402E790) ==5733== #0 pthread_mutex_lock /mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935 ==5733== #1 remoteDispatchOpenHelper /usr/src/debug/libvirt-0.9.12/daemon/remote.c:683 ==5733== #2 virNetServerProgramDispatch /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416 ==5733== #3 virNetServerHandleJob /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161 ==5733== #4 virThreadPoolWorker /usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144 ==5733== #5 virThreadHelper /usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161 ==5733== L173 (0x402F420) ==5733== #0 pthread_mutex_lock /mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935 ==5733== #1 remoteDispatchOpenHelper /usr/src/debug/libvirt-0.9.12/daemon/remote.c:683 ==5733== #2 virNetServerProgramDispatch /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416 ==5733== #3 virNetServerHandleJob /usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161 ==5733== #4 virThreadPoolWorker /usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144 ==5733== #5 virThreadHelper /usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161 ==5733== Race verifier data: 0x347203A572,0x347203A565 ==5733== }}} I didn't know that netcf was doing Relax-NG validation on-the-fly :) and hit another in libxml2 though libvirt/netcf/libxslt/libxml2 call: ==5733== WARNING: Possible data race during read of size 4 at 0x34723B10B4: {{{ ==5733== T1 (L{L173}): ==5733== #0 random_r /lib64/libc-2.14.90.so ==5733== #1 random /lib64/libc-2.14.90.so ==5733== #2 rand /lib64/libc-2.14.90.so ==5733== #3 xmlDictCreate /usr/src/debug/libxml2-2.8.0/dict.c:491 ==5733== #4 xmlInitParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1620 ==5733== #5 xmlNewParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1874 ==5733== #6 xmlCreateURLParserCtxt /usr/src/debug/libxml2-2.8.0/parser.c:13546 ==5733== #7 xmlReadFile /usr/src/debug/libxml2-2.8.0/parser.c:14713 ==5733== #8 xmlRelaxNGParse /usr/src/debug/libxml2-2.8.0/relaxng.c:7480 ==5733== #9 (no symbols) /usr/lib64/libnetcf.so.1.4.0 ==5733== #10 ncf_init /usr/lib64/libnetcf.so.1.4.0 ==5733== #11 interfaceOpenInterface /usr/src/debug/libvirt-0.9.12/src/interface/netcf_driver.c:144 ==5733== Concurrent write(s) happened at (OR AFTER) these points: ==5733== T10 (L{L156}): ==5733== #0 random_r /lib64/libc-2.14.90.so ==5733== #1 random /lib64/libc-2.14.90.so ==5733== #2 rand /lib64/libc-2.14.90.so ==5733== #3 xmlHashCreate /usr/src/debug/libxml2-2.8.0/hash.c:193 ==5733== #4 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:4125 ==5733== #5 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6380 ==5733== #6 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523 ==5733== #7 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6319 ==5733== #8 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6520 ==5733== #9 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523 ==5733== Address 0x34723B10B4 is 20 bytes inside data symbol "randtbl" ==5733== Locks involved in this report (reporting last lock sites): {L156, L173} Looks like another libxml2 bug calling rand, I should use rand_r() instead if available. I also ran virsh itself under tsan for a few functions, but that didn't raise any issue. So except for the parallel xdr_remote_nonnull_string() issue I don't think there is anything I can report at this point for libvirt code itself. The slowdown of tsan is acceptable, virt-manager is a clearly a bit less responsive when connecting of doing libvirt operations but it is really usable, and that's IMHO the main point of my testing, it's not hard (though it would be better if packaged for Fedora) and the results are interesting but a bit heavy since the same issue can pop-up all the time like for fork(). I guess it would be interesting to try out qemu-kvm with it though it may turn unbearably slow, but it's worth testing ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@xxxxxxxxxxxx | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list