Re: [PATCH alternative 1] util: fix libvirtd startup failure due to netlink error

"Daniel P. Berrange" <berrange@xxxxxxxxxx> · Thu, 3 May 2012 09:09:34 +0100

On Wed, May 02, 2012 at 04:35:48PM -0400, Laine Stump wrote:
> On 05/02/2012 11:32 AM, Daniel P. Berrange wrote:
> > On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote:
> >> On 05/02/2012 05:11 AM, Daniel P. Berrange wrote:
> >>> On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
> >>>> This patch is one alternative to solve the problem detailed in:
> >>>>
> >>>>   https://bugzilla.redhat.com/show_bug.cgi?id=816465
> >>>>
> >>>> Some other unidentified library in use by libvirtd (in another thread)
> >>>> is apparently temporarily binding to a NETLINK_ROUTE raw socket with
> >>>> an address of "pid of libvirtd" during startup. 
> >>> Can you identify this library.
> >>
> >> I made a few attempts, but didn't have any luck and decided to post
> >> these patches based on the other evidence I'd gathered. I agree that I
> >> would much prefer understanding who is doing this, even if it doesn't
> >> change the workaround method.
> >>
> >>
> >>>  It should be possible to do so using
> >>> systemtap without all that much trouble.
> >> My full experience with systemtap is using some of the examples from
> >> your blog posting on the topic :-)
> > I assume you mean this one
> >
> > http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-using-systemtap/
> 
> Yes, that's the one. I wasn't actually interested in watching the rpc
> protocol, but the interaction between libvirtd and the qemu monitor,
> which was very helpful.
> 
> 
> >> I would love to figure this out, though. The complicating factor I can
> >> see (aside from me needing to learn how to write a systemtap script) is
> >> that in this case stap needs to be run on a daemonizing process, from
> >> the very beginning. If you can give me any better advice than "go read
> >> the systemtap website", please do.
> > I can't help today, but ping me on IRC tomorrow and I'll help you
> > get sorted with systemtap. You can start the stap scripts before
> > even running libvirtd, so there's no issue with the daemonizing
> > side of things.
> 
> 
> With some help from mjw in #systemtap on freenode, I was able to figure
> out how to use systemtap to print a backtrace all calls to bind, and
> although the failures ceased as soon as I turned on the tracing (of
> course), it did at least give me a list of bind calls to research.
> 
> It turns out that this is the interesting one (or one example of it,
> anyway):
> 
> [23876,init
>  0x35b90e8277 : bind+0x7/0x30 [/lib64/libc-2.12.so]
>  0x35b910e540 : __check_pf+0x80/0xf0 [/lib64/libc-2.12.so]
>  0x35b90d1ab7 : getaddrinfo+0xe7/0x890 [/lib64/libc-2.12.so]
>  0x7fa695f1e61d : virSocketAddrParse+0x4d/0x190
> [/usr/lib64/libvirt.so.0.9.10]
>  0x7fa695f47f2a : virNetworkIPParseXML+0xaa/0x4c0
> [/usr/lib64/libvirt.so.0.9.10]
>  0x7fa695f48f37 : virNetworkDefParseNode+0xbf7/0x19e0
> [/usr/lib64/libvirt.so.0.9.10]
>  0x7fa695f49d77 : virNetworkDefParse+0x57/0x70
> [/usr/lib64/libvirt.so.0.9.10]
>  0x7fa695f49e2c : virNetworkLoadConfig+0x8c/0x1b0
> [/usr/lib64/libvirt.so.0.9.10]
>  0x7fa695f49fb3 : virNetworkLoadAllConfigs+0x63/0x100
> [/usr/lib64/libvirt.so.0.9.10]
>  0x4d5f97 : networkStartup+0x157/0x460 [/usr/sbin/libvirtd]
>  0x7fa695f806d0 : virStateInitialize+0x60/0xd0
> [/usr/lib64/libvirt.so.0.9.10]
>  0x420ff1 : daemonRunStateInit+0x11/0x80 [/usr/sbin/libvirtd]
>  0x7fa695f08749 : virThreadHelper+0x29/0x40 [/usr/lib64/libvirt.so.0.9.10]
>  0x35b9c07851 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so]
>  0x35b90e767d : __clone+0x6d/0x90 [/lib64/libc-2.12.so]
> ]
> 
> __check_pf() is in glibc - sysdeps/unix/sysv/linux/check_pf.c, and it
> does directly (not through libnl) call socket(PF_NETLINK, SOCK_RAW,
> NETLINK_ROUTE), set the nladdr to 0's, then bind() it. In the kernel,
> netlink_bind() uses 0 as an indicator that it should auto-bind,
> preferring the pid of the calling process (i.e. "pid of libvirtd") as
> its nl_pid in the nladdr. This NETLINK socket is used for a short period
> to get a list of interface addresses, and is then closed.
> 
> Once main() has started up its other threads, these threads may call
> virSocketAddrParse (and thus __check_pf()) any number of times, creating
> many socket/bind/close cycles of NETLINK sockets. Meanwhile, in the main
> thread, virNetlinkEventServiceStart() is the first function in libvirtd
> to call libnl's nl_handle_alloc(), which mistakenly assumes that it has
> all control over netlink sockets, and that it can assign the address of
> "pid of libvirtd" to this nlhandle. Shortly after that, nl_connect() is
> called, which calls bind() with a *fixed* address of "pid of libvirtd".
> If another thread happens to currently be in a call to __pf_check(), we
> lose the lottery and bind() fails. If not, we win the lottery, bind()
> succeeds, and future calls to bind() by __check_pf() will auto-bind to a
> different address (unlike with libnl, which assigns subsequent sockets
> the address of "pid + (n << 22)" with a maximum of 1024 sockets per
> process (i.e. it will always be positive), auto-binds in the kernel will
> assign the first free address found between -2047 and -2,147,483,648
> (i.e. it will always be negative)).
> 
> So, the conclusions to draw from this analysis are:
> 
> 1) my "alternative 1" patch was only coincidentally succeeding, and
> would be about as useful as everyone removing their shoes at airport
> security checkpoints.
> 
> 2) If libvirtd has multiple threads started up before any netlink
> sockets have been bound to "pid of libvirtd", there is a possibility
> that the first call to nl_connect will fail (due to another thread being
> in getaddrinfo/__check_pf()). This is just as true for the macvtap and
> netcf uses of libnl as for the virNetlinkEventService use.
> 
> 3) Once the first call to nl_connect is successfully completed (and/or
> if an extra (and otherwise unused) nlhandle is created with
> nl_handle_alloc() before creating any nlhandles that are subsequently
> nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is
> effectively 0, since the address space used by libnl is all positive 32
> bit numbers, and the address space used by the auto-bind address in the
> kernel is (almost) all negative 32 bit numbers.
> 
> 4) libnl should, at the very least, be modified to not use exactly
> nl_pid = pid, since there is a very high likelihood that particular
> address will already be taken by a library function that is calling bind
> directly, rather than through libnl. Really, its API shouldn't allow
> applications to retrieve the bind address used until after nl_connect()
> has already completed successfully; unfortunately, that would require an
> incompatible change in the API.
> 
> Now that I completely understand the problem, I actually think that
> neither of these patches is quite correct; the first because it is
> simply bogus, and the second because it only solves the problem to
> virNetlinkEventService - it still leaves open the possibility that
> macvtap or netcf usage of libnl could result in a failure (although
> *only* if one of those uses happened to be called prior to
> virNetlinkEventService).
> 
> To be 100% safe, I think what we need to do is put an extra call to
> nl_handle_alloc() very early in main, prior to calling
> virNetServerNew(), which is when all the other worker threads are
> created. I'll put together such a patch and send it to the list later
> tonight.

Wow, thanks for figuring this out. It is all far worse than I imagined :-(

Clearly libnl is broken here, but I guess it is dead upstream in favour
of libnl3. I wonder if that shares the same problem.

Agree that creating a netlink handle in libvirtd main() sounds like a
way to workaround it.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list