On Tue, Jan 29, 2019 at 09:44:05AM +0100, Håkon Bugge wrote: > When an IB port has been brought back to Active state, after being > down, ibacm gets an event about it. It will then (re) enumerate the > devices, and does so by executing an ioctl with SIOCGIFCONF. This > particular ioctl will only return interfaces that are "running". > > There may be a delay after the IB port becomes Active until its > address has been provisioned, and becomes "running". If ibacm attempts > to associate IPoIB interfaces to the port during this interval, it > will not see the interface because it is not "running". > > Later, when ibacm is asked for a Path Record (PR) using the IP address > of the resurrected IPoIB interface, it will not be able to find the > associated end-point (EP), and the following is printed in the log: > > acm_svr_resolve_path: notice - unknown local end point address > > The bug can be provoked by the following script. We have a single HCA > with two ports, the IPoIB interfaces are named stib{0,1}, the IP > address of the first interface is 192.168.200.200, and the remote IP > address is 192.168.200.202. The LID of the IB switch is 1 and the > switch port number connected to port 1 of the HCA is 22. > > <script> > #!/bin/bash > > ibportstate -P 2 1 22 disable > > # move the IP address > ip addr del 192.168.200.200/24 dev stib0 > ip addr add 192.168.200.200/24 dev stib1 > > ibportstate -P 2 1 22 enable > > # Wait until port becomes active again > while [[ $(ibstat|grep State:|grep -c Active) != 2 ]]; do > echo -n "."; sleep 1 > done > echo > > # give ibacm time to re-enumerate the interfaces > sleep 1 > > # move the IP address back > ip addr del 192.168.200.200/24 dev stib1 > ip addr add 192.168.200.200/24 dev stib0 > > # take the other port down, so we are sure we use stib0 > ip link set down dev stib1 > > # Start a utility requesting a PR > qperf 192.168.200.202 -cm1 rc_bw > > # check for failure > grep "unknown local end point" ibacm.log > > # restore the state > ip link set up dev stib1 > </script> > > This fix depends on the commit that re-factors the use of ioctl in > acm_if_iter(), and instead uses netlink. Now, by reducing the > requirements of the state of the interface, the EP is added, and > afterwards, when an address is assigned, it is associated with the EP. > > This commit is a new implementation of > https://patchwork.kernel.org/patch/10748357, which was NAKed. > > Signed-off-by: Håkon Bugge <haakon.bugge@xxxxxxxxxx> Reviewed-by: Ira Weiny <ira.weiny@xxxxxxxxx> > > v1 -> v2: > * Removed Gerrit's Change-Id tag (Håkon) > --- > ibacm/src/acm_util.c | 14 +++----------- > 1 file changed, 3 insertions(+), 11 deletions(-) > > diff --git a/ibacm/src/acm_util.c b/ibacm/src/acm_util.c > index fecb6c89..8807579d 100644 > --- a/ibacm/src/acm_util.c > +++ b/ibacm/src/acm_util.c > @@ -180,19 +180,13 @@ static void acm_if_iter(struct nl_object *obj, void *_ctx_and_cb) > uint16_t pkey; > int addr_len; > char *label; > - int flags; > - int ret; > int af; > > link = rtnl_link_get(link_cache, rtnl_addr_get_ifindex(addr)); > - flags = rtnl_link_get_flags(link); > > if (rtnl_link_get_arptype(link) != ARPHRD_INFINIBAND) > return; > > - if (!(flags & IFF_RUNNING)) > - return; > - > if (!(a = rtnl_addr_get_local(addr))) > return; > > @@ -206,20 +200,18 @@ static void acm_if_iter(struct nl_object *obj, void *_ctx_and_cb) > return; > > label = rtnl_addr_get_label(addr); > - if (!label) > - return; > > link_addr = rtnl_link_get_addr(link); > + /* gid has a 4 byte offset into the link address */ > memcpy(sgid.raw, nl_addr_get_binary_addr(link_addr) + 4, sizeof(sgid)); > > - ret = acm_if_get_pkey(rtnl_link_get_name(link), &pkey); > - if (ret) > + if (acm_if_get_pkey(rtnl_link_get_name(link), &pkey)) > return; > > acm_log(2, "name: %5s label: %9s index: %2d flags: %s addr: %s pkey: 0x%04x guid: 0x%lx\n", > rtnl_link_get_name(link), label, > rtnl_addr_get_ifindex(addr), > - rtnl_link_flags2str(flags, flags_str, sizeof(flags_str)), > + rtnl_link_flags2str(rtnl_link_get_flags(link), flags_str, sizeof(flags_str)), > nl_addr2str(a, ip_str, sizeof(ip_str)), pkey, > be64toh(sgid.global.interface_id)); > > -- > 2.20.1 >