On Mon, Feb 14, 2011 at 02:34:28PM -0500, Stefan Berger wrote: > When trying to start / stop a domain with macvtap device (direct > type of interface) having a device description like this one here > > <interface type='direct'> > <source dev='static' mode='vepa'/> > </interface> > > then I see netlink related errors when a 'virsh edit' session is > happening at the same time. Leaving 'virsh edit' active, means there is a virConnectPtr object open in libvirtd. This in turn means there is a netcf instance active in libvirtd. This in turn means there is a netlink socket open by netcf in libvirtd. > So, to reproduce this error you should try the following (on a > kernel supporting macvtap): > > virsh edit <macvtap domain> -> do not terminate the edit sessions > virsh start <macvtap domain> -> works > virsh destroy <macvtap domain> -> leaves a macvtap device due to > nl_connect failing > virsh start <macvtap domain> -> does not start anymore > > That should make it fail. > > The work-around basically keeps on allocating new netlink library > handles until the nl_connect() succeeds. New netlink library handles > cause the next available port (intern to libnl; nl_pid) to be > allocated. For every ongoing virsh edit session, one new handle > seems to be required. So for 2 virsh edit session, the 3rd one > (usually) works. I do not know what in the system is causing this, > but my guess it that 'something' is blocking the same port (nl_pid) > -- it may not be in libvirt but in a dependent library that's not > using libnl (?). This problem sequence seems to imply something crazy like us only being allowed to have one netlink socket open per $PID ??? Is that really correct ? Looking at the kernel code they seem todo some kind of hash based on PID & return EBUSY in that..which suggests there may be some kind of restriction... > > Signed-off-by: Stefan Berger <stefanb@xxxxxxxxxxxxxxxxxx> > > Index: libvirt-acl/src/util/macvtap.c > =================================================================== > --- libvirt-acl.orig/src/util/macvtap.c > +++ libvirt-acl/src/util/macvtap.c > @@ -120,13 +120,43 @@ int nlComm(struct nl_msg *nl_msg, > fd_set readfds; > int fd; > int n; > - struct nl_handle *nlhandle = nl_handle_alloc(); > + struct nl_handle **nlhandles = NULL; > struct nlmsghdr *nlmsg = nlmsg_hdr(nl_msg); > + unsigned int idx = 0, num_elms = 1, i; > > - if (!nlhandle) > - return -1; > +realloc: > + if (VIR_REALLOC_N(nlhandles, num_elms * sizeof(struct nl_handle > *)) < 0) { > + virReportOOMError(); > + rc = -1; > + goto err_exit; > + } > + > + for (i = idx; i < num_elms ; i++) > + nlhandles[i] = NULL; > + > +next_handle: > + nlhandles[idx] = nl_handle_alloc(); > > - if (nl_connect(nlhandle, NETLINK_ROUTE) < 0) { > + if (nlhandles[idx] == NULL) { > + virReportOOMError(); > + rc = -1; > + goto err_exit; > + } > + > + if (nl_connect(nlhandles[idx], NETLINK_ROUTE) < 0) { > + VIR_DEBUG0("Could not create netlink socket - trying a new one\n"); > + /* get a new handle and keep the ones we have */ > + idx++; > + if (idx < num_elms) > + goto next_handle; > + /* need to reallocate */ > + num_elms += 10; > + if (idx < 500) > + goto realloc; > + > + macvtapError(VIR_ERR_INTERNAL_ERROR, "%s [%s]", > + _("Could not create netlink socket"), > + nl_geterror()); > rc = -1; > goto err_exit; > } > @@ -135,15 +165,16 @@ int nlComm(struct nl_msg *nl_msg, > > nlmsg->nlmsg_pid = getpid(); > > - nbytes = nl_send_auto_complete(nlhandle, nl_msg); > + nbytes = nl_send_auto_complete(nlhandles[idx], nl_msg); > if (nbytes < 0) { > virReportSystemError(errno, > - "%s", _("cannot send to netlink socket")); > + "%s [%s]", _("cannot send to netlink socket"), > + nl_geterror()); > rc = -1; > goto err_exit; > } > > - fd = nl_socket_get_fd(nlhandle); > + fd = nl_socket_get_fd(nlhandles[idx]); > > FD_ZERO(&readfds); > FD_SET(fd, &readfds); > @@ -160,7 +191,7 @@ int nlComm(struct nl_msg *nl_msg, > goto err_exit; > } > > - *respbuflen = nl_recv(nlhandle, &nladdr, respbuf, NULL); > + *respbuflen = nl_recv(nlhandles[idx], &nladdr, respbuf, NULL); > if (*respbuflen <= 0) > rc = -1; > > @@ -171,7 +202,12 @@ err_exit: > *respbuflen = 0; > } > > - nl_handle_destroy(nlhandle); > + if (nlhandles) > + for (idx = 0; idx < num_elms; idx++) > + nl_handle_destroy(nlhandles[idx]); > + > + VIR_FREE(nlhandles); > + > return rc; > } This approach feels like a nasty hack to me and potentially still leaves us with a problem in netcf which is also using netlink sockets. I think we need to get a clearer picture of what the root cause is before going for this kind of patch Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list