This patch is one alternative to solve the problem detailed in: https://bugzilla.redhat.com/show_bug.cgi?id=816465 Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. This is the same address used by libnl for the first netlink socket it binds, and the netlink socket allocated for virNetlinkEventServiceStart() happens to be that first socket; the result is that nl_connect() fails about 15-20% of the time (but apparently only if there is a guest running at the time libvirtd starts). Testing has shown that in the case that nl_connect fails the first time, retrying it after a 500msec sleep leads to success 100% of the time, so this patch doubles that delay (which also has 100% success rate. An alternate patch is to allocate an extra nl_handle that will never be used, thus effectively "reserving" the "pid of libvirtd" address for the mystery library. I will be sending that in a separate patch so everyone has the change to choose. (Note that a similar-looking problem came up over a year ago with the libnl usage by macvtap code. At that time Stefan Berger found bugs in libnl itself. These new errors are encountered while using the patched libnl; the main problem remaining in libnl is with the semantics of the API, which assumes that libnl is the only entity on the system (or at least in the current process) using netlink sockets, and it can thus make an assumption about what address to use for binding.) --- src/util/virnetlink.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index b2e9d51..b9dae86 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -355,9 +355,18 @@ virNetlinkEventServiceStart(void) } if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) { - virReportSystemError(errno, - "%s", _("cannot connect to netlink socket")); - goto error_server; + /* the address that libnl wants to use for this connect ("pid + * of libvirtd") is sometimes temporarily in use by some other + * unidentified code. Retrying after a 500msec sleep has + * achieved 100% success rates, so we sleep for 1000msec and + * retry. + */ + usleep(1000000); + if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) { + virReportSystemError(errno, + "%s", _("cannot connect to netlink socket")); + goto error_server; + } } fd = nl_socket_get_fd(srv->netlinknh); -- 1.7.10 -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list