On 05/02/2012 11:32 AM, Daniel P. Berrange wrote: > On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote: >> On 05/02/2012 05:11 AM, Daniel P. Berrange wrote: >>> On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote: >>>> This patch is one alternative to solve the problem detailed in: >>>> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=816465 >>>> >>>> Some other unidentified library in use by libvirtd (in another thread) >>>> is apparently temporarily binding to a NETLINK_ROUTE raw socket with >>>> an address of "pid of libvirtd" during startup. >>> Can you identify this library. >> >> I made a few attempts, but didn't have any luck and decided to post >> these patches based on the other evidence I'd gathered. I agree that I >> would much prefer understanding who is doing this, even if it doesn't >> change the workaround method. >> >> >>> It should be possible to do so using >>> systemtap without all that much trouble. >> My full experience with systemtap is using some of the examples from >> your blog posting on the topic :-) > I assume you mean this one > > http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-using-systemtap/ Yes, that's the one. I wasn't actually interested in watching the rpc protocol, but the interaction between libvirtd and the qemu monitor, which was very helpful. >> I would love to figure this out, though. The complicating factor I can >> see (aside from me needing to learn how to write a systemtap script) is >> that in this case stap needs to be run on a daemonizing process, from >> the very beginning. If you can give me any better advice than "go read >> the systemtap website", please do. > I can't help today, but ping me on IRC tomorrow and I'll help you > get sorted with systemtap. You can start the stap scripts before > even running libvirtd, so there's no issue with the daemonizing > side of things. With some help from mjw in #systemtap on freenode, I was able to figure out how to use systemtap to print a backtrace all calls to bind, and although the failures ceased as soon as I turned on the tracing (of course), it did at least give me a list of bind calls to research. It turns out that this is the interesting one (or one example of it, anyway): [23876,init 0x35b90e8277 : bind+0x7/0x30 [/lib64/libc-2.12.so] 0x35b910e540 : __check_pf+0x80/0xf0 [/lib64/libc-2.12.so] 0x35b90d1ab7 : getaddrinfo+0xe7/0x890 [/lib64/libc-2.12.so] 0x7fa695f1e61d : virSocketAddrParse+0x4d/0x190 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f47f2a : virNetworkIPParseXML+0xaa/0x4c0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f48f37 : virNetworkDefParseNode+0xbf7/0x19e0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49d77 : virNetworkDefParse+0x57/0x70 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49e2c : virNetworkLoadConfig+0x8c/0x1b0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49fb3 : virNetworkLoadAllConfigs+0x63/0x100 [/usr/lib64/libvirt.so.0.9.10] 0x4d5f97 : networkStartup+0x157/0x460 [/usr/sbin/libvirtd] 0x7fa695f806d0 : virStateInitialize+0x60/0xd0 [/usr/lib64/libvirt.so.0.9.10] 0x420ff1 : daemonRunStateInit+0x11/0x80 [/usr/sbin/libvirtd] 0x7fa695f08749 : virThreadHelper+0x29/0x40 [/usr/lib64/libvirt.so.0.9.10] 0x35b9c07851 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x35b90e767d : __clone+0x6d/0x90 [/lib64/libc-2.12.so] ] __check_pf() is in glibc - sysdeps/unix/sysv/linux/check_pf.c, and it does directly (not through libnl) call socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE), set the nladdr to 0's, then bind() it. In the kernel, netlink_bind() uses 0 as an indicator that it should auto-bind, preferring the pid of the calling process (i.e. "pid of libvirtd") as its nl_pid in the nladdr. This NETLINK socket is used for a short period to get a list of interface addresses, and is then closed. Once main() has started up its other threads, these threads may call virSocketAddrParse (and thus __check_pf()) any number of times, creating many socket/bind/close cycles of NETLINK sockets. Meanwhile, in the main thread, virNetlinkEventServiceStart() is the first function in libvirtd to call libnl's nl_handle_alloc(), which mistakenly assumes that it has all control over netlink sockets, and that it can assign the address of "pid of libvirtd" to this nlhandle. Shortly after that, nl_connect() is called, which calls bind() with a *fixed* address of "pid of libvirtd". If another thread happens to currently be in a call to __pf_check(), we lose the lottery and bind() fails. If not, we win the lottery, bind() succeeds, and future calls to bind() by __check_pf() will auto-bind to a different address (unlike with libnl, which assigns subsequent sockets the address of "pid + (n << 22)" with a maximum of 1024 sockets per process (i.e. it will always be positive), auto-binds in the kernel will assign the first free address found between -2047 and -2,147,483,648 (i.e. it will always be negative)). So, the conclusions to draw from this analysis are: 1) my "alternative 1" patch was only coincidentally succeeding, and would be about as useful as everyone removing their shoes at airport security checkpoints. 2) If libvirtd has multiple threads started up before any netlink sockets have been bound to "pid of libvirtd", there is a possibility that the first call to nl_connect will fail (due to another thread being in getaddrinfo/__check_pf()). This is just as true for the macvtap and netcf uses of libnl as for the virNetlinkEventService use. 3) Once the first call to nl_connect is successfully completed (and/or if an extra (and otherwise unused) nlhandle is created with nl_handle_alloc() before creating any nlhandles that are subsequently nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is effectively 0, since the address space used by libnl is all positive 32 bit numbers, and the address space used by the auto-bind address in the kernel is (almost) all negative 32 bit numbers. 4) libnl should, at the very least, be modified to not use exactly nl_pid = pid, since there is a very high likelihood that particular address will already be taken by a library function that is calling bind directly, rather than through libnl. Really, its API shouldn't allow applications to retrieve the bind address used until after nl_connect() has already completed successfully; unfortunately, that would require an incompatible change in the API. Now that I completely understand the problem, I actually think that neither of these patches is quite correct; the first because it is simply bogus, and the second because it only solves the problem to virNetlinkEventService - it still leaves open the possibility that macvtap or netcf usage of libnl could result in a failure (although *only* if one of those uses happened to be called prior to virNetlinkEventService). To be 100% safe, I think what we need to do is put an extra call to nl_handle_alloc() very early in main, prior to calling virNetServerNew(), which is when all the other worker threads are created. I'll put together such a patch and send it to the list later tonight. -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list