On Wed, May 02, 2012 at 04:35:48PM -0400, Laine Stump wrote: > On 05/02/2012 11:32 AM, Daniel P. Berrange wrote: > > On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote: > >> On 05/02/2012 05:11 AM, Daniel P. Berrange wrote: > >>> On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote: > >>>> This patch is one alternative to solve the problem detailed in: > >>>> > >>>> https://bugzilla.redhat.com/show_bug.cgi?id=816465 > >>>> > >>>> Some other unidentified library in use by libvirtd (in another thread) > >>>> is apparently temporarily binding to a NETLINK_ROUTE raw socket with > >>>> an address of "pid of libvirtd" during startup. > >>> Can you identify this library. > >> > >> I made a few attempts, but didn't have any luck and decided to post > >> these patches based on the other evidence I'd gathered. I agree that I > >> would much prefer understanding who is doing this, even if it doesn't > >> change the workaround method. > >> > >> > >>> It should be possible to do so using > >>> systemtap without all that much trouble. > >> My full experience with systemtap is using some of the examples from > >> your blog posting on the topic :-) > > I assume you mean this one > > > > http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-using-systemtap/ > > Yes, that's the one. I wasn't actually interested in watching the rpc > protocol, but the interaction between libvirtd and the qemu monitor, > which was very helpful. > > > >> I would love to figure this out, though. The complicating factor I can > >> see (aside from me needing to learn how to write a systemtap script) is > >> that in this case stap needs to be run on a daemonizing process, from > >> the very beginning. If you can give me any better advice than "go read > >> the systemtap website", please do. > > I can't help today, but ping me on IRC tomorrow and I'll help you > > get sorted with systemtap. You can start the stap scripts before > > even running libvirtd, so there's no issue with the daemonizing > > side of things. > > > With some help from mjw in #systemtap on freenode, I was able to figure > out how to use systemtap to print a backtrace all calls to bind, and > although the failures ceased as soon as I turned on the tracing (of > course), it did at least give me a list of bind calls to research. > > It turns out that this is the interesting one (or one example of it, > anyway): > > [23876,init > 0x35b90e8277 : bind+0x7/0x30 [/lib64/libc-2.12.so] > 0x35b910e540 : __check_pf+0x80/0xf0 [/lib64/libc-2.12.so] > 0x35b90d1ab7 : getaddrinfo+0xe7/0x890 [/lib64/libc-2.12.so] > 0x7fa695f1e61d : virSocketAddrParse+0x4d/0x190 > [/usr/lib64/libvirt.so.0.9.10] > 0x7fa695f47f2a : virNetworkIPParseXML+0xaa/0x4c0 > [/usr/lib64/libvirt.so.0.9.10] > 0x7fa695f48f37 : virNetworkDefParseNode+0xbf7/0x19e0 > [/usr/lib64/libvirt.so.0.9.10] > 0x7fa695f49d77 : virNetworkDefParse+0x57/0x70 > [/usr/lib64/libvirt.so.0.9.10] > 0x7fa695f49e2c : virNetworkLoadConfig+0x8c/0x1b0 > [/usr/lib64/libvirt.so.0.9.10] > 0x7fa695f49fb3 : virNetworkLoadAllConfigs+0x63/0x100 > [/usr/lib64/libvirt.so.0.9.10] > 0x4d5f97 : networkStartup+0x157/0x460 [/usr/sbin/libvirtd] > 0x7fa695f806d0 : virStateInitialize+0x60/0xd0 > [/usr/lib64/libvirt.so.0.9.10] > 0x420ff1 : daemonRunStateInit+0x11/0x80 [/usr/sbin/libvirtd] > 0x7fa695f08749 : virThreadHelper+0x29/0x40 [/usr/lib64/libvirt.so.0.9.10] > 0x35b9c07851 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] > 0x35b90e767d : __clone+0x6d/0x90 [/lib64/libc-2.12.so] > ] > > __check_pf() is in glibc - sysdeps/unix/sysv/linux/check_pf.c, and it > does directly (not through libnl) call socket(PF_NETLINK, SOCK_RAW, > NETLINK_ROUTE), set the nladdr to 0's, then bind() it. In the kernel, > netlink_bind() uses 0 as an indicator that it should auto-bind, > preferring the pid of the calling process (i.e. "pid of libvirtd") as > its nl_pid in the nladdr. This NETLINK socket is used for a short period > to get a list of interface addresses, and is then closed. > > Once main() has started up its other threads, these threads may call > virSocketAddrParse (and thus __check_pf()) any number of times, creating > many socket/bind/close cycles of NETLINK sockets. Meanwhile, in the main > thread, virNetlinkEventServiceStart() is the first function in libvirtd > to call libnl's nl_handle_alloc(), which mistakenly assumes that it has > all control over netlink sockets, and that it can assign the address of > "pid of libvirtd" to this nlhandle. Shortly after that, nl_connect() is > called, which calls bind() with a *fixed* address of "pid of libvirtd". > If another thread happens to currently be in a call to __pf_check(), we > lose the lottery and bind() fails. If not, we win the lottery, bind() > succeeds, and future calls to bind() by __check_pf() will auto-bind to a > different address (unlike with libnl, which assigns subsequent sockets > the address of "pid + (n << 22)" with a maximum of 1024 sockets per > process (i.e. it will always be positive), auto-binds in the kernel will > assign the first free address found between -2047 and -2,147,483,648 > (i.e. it will always be negative)). > > So, the conclusions to draw from this analysis are: > > 1) my "alternative 1" patch was only coincidentally succeeding, and > would be about as useful as everyone removing their shoes at airport > security checkpoints. > > 2) If libvirtd has multiple threads started up before any netlink > sockets have been bound to "pid of libvirtd", there is a possibility > that the first call to nl_connect will fail (due to another thread being > in getaddrinfo/__check_pf()). This is just as true for the macvtap and > netcf uses of libnl as for the virNetlinkEventService use. > > 3) Once the first call to nl_connect is successfully completed (and/or > if an extra (and otherwise unused) nlhandle is created with > nl_handle_alloc() before creating any nlhandles that are subsequently > nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is > effectively 0, since the address space used by libnl is all positive 32 > bit numbers, and the address space used by the auto-bind address in the > kernel is (almost) all negative 32 bit numbers. > > 4) libnl should, at the very least, be modified to not use exactly > nl_pid = pid, since there is a very high likelihood that particular > address will already be taken by a library function that is calling bind > directly, rather than through libnl. Really, its API shouldn't allow > applications to retrieve the bind address used until after nl_connect() > has already completed successfully; unfortunately, that would require an > incompatible change in the API. > > Now that I completely understand the problem, I actually think that > neither of these patches is quite correct; the first because it is > simply bogus, and the second because it only solves the problem to > virNetlinkEventService - it still leaves open the possibility that > macvtap or netcf usage of libnl could result in a failure (although > *only* if one of those uses happened to be called prior to > virNetlinkEventService). > > To be 100% safe, I think what we need to do is put an extra call to > nl_handle_alloc() very early in main, prior to calling > virNetServerNew(), which is when all the other worker threads are > created. I'll put together such a patch and send it to the list later > tonight. Wow, thanks for figuring this out. It is all far worse than I imagined :-( Clearly libnl is broken here, but I guess it is dead upstream in favour of libnl3. I wonder if that shares the same problem. Agree that creating a netlink handle in libvirtd main() sounds like a way to workaround it. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list