On Thu, Apr 11, 2019 at 07:22:35AM -0700, Dennis Dalessandro wrote: > When IPoIB receives an SM LID change event, it reacts by flushing its > path record cache and rejoining multicast groups. This is the same > behavior it performs when it receives a reregistration event. This > behavior is unnecessary as an SM may have database backup or > synchronization mechanisms which permit the SM location or LID to change > without loss of multicast membership and without impact to path records. > > Both opensm and the OPA FM issue reregistration events if a new SM is > started (or restarted with a new config) or an SM event occurs which > results in loss of multicast membership records by the SM (such as > opensm failover) or the SM encounters new nodes with Active ports (such > as after joining 2 fabrics by connecting switches via ISLs). Hence this > event can be depended on as the trigger for IPoIB cache and multicast > flushing. > > It appears that some drivers, such as qib, and hfi1 issue the > IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not. > Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed > that Mellanox EDR HCAs do not generate SM change events and that opensm > does generate reregistration. > > An SM LID change event is generated by the mentioned drivers to reflect > that sm_lid and/or sm_sl in the local port info has changed. The intent > of this event is to permit applications and ULPs which have a local copy > of this information (or an address handle using it) to update their > information. > > The intent is that the reregistration event (caused by the SM via a bit > in Set(PortInfo)) be used to inform nodes that they need to rejoin > multicast groups, resubscribe for notices and potentially update path > records. > > When an SM migrates or fails over, a SM LID change event can occur. In > response IPoIB discards path records and multicast membership and loses > connectivity until these records are restored via SA requests. In very > large fabrics, it may take minutes for the SM to be ready and for the SA > responses to be supplied. This can result in undesirable and > unnecessary IPoIB connectivity impacts. It also can result in an > unnecessary storm of SA queries from all nodes in a cluster potentially > followed by yet another storm if the SM issues the reregistration > request. > > The fact the Mellanox HCAs do not even generate this event, is further > evidence that on modern IB fabrics there will be no ill side effects > from the proposed changes below to reduce the reaction by 3 kernel > components to this event. So these changes should be benign for Mellanox > IB fabrics and will benefit OPA fabrics while also making ib_core and > ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event > APIs. > > Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib. > IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do > anything on SM LID change. IPoIB makes use of other ib_core components > to issue SA requests for it and those components correctly track SM LID > and SM LID changes. > > Also in ib_core multicast handling, remove the test for > IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the > error state, which will trigger rejoins. This code is used by IPoIB as > well as the connection manager and other clients of multicast groups. > This kernel module centralizes group membership status and joins since a > node can only join a given group once but multiple ULPs or applications > may want to join the same group. It makes use of the sa_query.c > component in ib_core, which correctly trackes SM LID and SL. This > component does not track SM LID nor SL itself and hence need not react > to their changes. > > Similarly in the ib_core cache code remove the handling for the > IB_EVENT_SM_CHANGE. In this function. The ib_cache_update function > which is ultimately called is updating local copies of the pkey table, > gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As > such it does not need to be called on an SM LID change. It technically > also does not need to be called on a reregistration. The LID_CHANGE, > PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR, > PORT_ACTICE) should be sufficient triggers. > > It is worth noting that the alternative of simply having the hfi1 and > qib drivers not generate the SM LID change event was explored. While > this would duplicate what Mellanox drivers do now, it is not the correct > behavior and removes the ability for an SM to migrate without requiring > reregistration. Since both opensm and OPA SM have mechanisms to backup > or synchronize registration information, it is desirable to let them > perform SM migrations (with LID or SL changes) without requiring > reregistration when they deem it appropriate. > > Suggested-by: Todd Rimmer <todd.rimmer@xxxxxxxxx> > Tested-by: Michael Brooks <michael.brooks@xxxxxxxxx> > Reviewed-by: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx> > Reviewed-by: Todd Rimmer <todd.rimmer@xxxxxxxxx> > Signed-off-by: Dennis Dalessandro <dennis.dalessandro@xxxxxxxxx> > --- > drivers/infiniband/core/cache.c | 1 - > drivers/infiniband/core/multicast.c | 1 - > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 3 +-- > 3 files changed, 1 insertions(+), 4 deletions(-) Applied to for-next, thanks Jason