Re: [PATCH for-next] IB/core, ipoib: Do not overreact to SM LID change event

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 11, 2019 at 07:22:35AM -0700, Dennis Dalessandro wrote:
> When IPoIB receives an SM LID change event, it reacts by flushing its
> path record cache and rejoining multicast groups. This is the same
> behavior it performs when it receives a reregistration event. This
> behavior is unnecessary as an SM may have database backup or
> synchronization mechanisms which permit the SM location or LID to change
> without loss of multicast membership and without impact to path records.
> 
> Both opensm and the OPA FM issue reregistration events if a new SM is
> started (or restarted with a new config) or an SM event occurs which
> results in loss of multicast membership records by the SM (such as
> opensm failover) or the SM encounters new nodes with Active ports (such
> as after joining 2 fabrics by connecting switches via ISLs). Hence this
> event can be depended on as the trigger for IPoIB cache and multicast
> flushing.
> 
> It appears that some drivers, such as qib, and hfi1 issue the
> IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
> Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
> that Mellanox EDR HCAs do not generate SM change events and that opensm
> does generate reregistration.
> 
> An SM LID change event is generated by the mentioned drivers to reflect
> that sm_lid and/or sm_sl in the local port info has changed. The intent
> of this event is to permit applications and ULPs which have a local copy
> of this information (or an address handle using it) to update their
> information.
> 
> The intent is that the reregistration event (caused by the SM via a bit
> in Set(PortInfo)) be used to inform nodes that they need to rejoin
> multicast groups, resubscribe for notices and potentially update path
> records.
> 
> When an SM migrates or fails over, a SM LID change event can occur. In
> response IPoIB discards path records and multicast membership and loses
> connectivity until these records are restored via SA requests. In very
> large fabrics, it may take minutes for the SM to be ready and for the SA
> responses to be supplied.  This can result in undesirable and
> unnecessary IPoIB connectivity impacts. It also can result in an
> unnecessary storm of SA queries from all nodes in a cluster potentially
> followed by yet another storm if the SM issues the reregistration
> request.
> 
> The fact the Mellanox HCAs do not even generate this event, is further
> evidence that on modern IB fabrics there will be no ill side effects
> from the proposed changes below to reduce the reaction by 3 kernel
> components to this event. So these changes should be benign for Mellanox
> IB fabrics and will benefit OPA fabrics while also making ib_core and
> ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
> APIs.
> 
> Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
> IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
> anything on SM LID change. IPoIB makes use of other ib_core components
> to issue SA requests for it and those components correctly track SM LID
> and SM LID changes.
> 
> Also in ib_core multicast handling,  remove the test for
> IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
> error state, which will trigger rejoins. This code is used by IPoIB as
> well as the connection manager and other clients of multicast groups.
> This kernel module centralizes group membership status and joins since a
> node can only join a given group once but multiple ULPs or applications
> may want to join the same group. It makes use of the sa_query.c
> component in ib_core, which correctly trackes SM LID and SL. This
> component does not track SM LID nor SL itself and hence need not react
> to their changes.
> 
> Similarly in the ib_core cache code remove the handling for the
> IB_EVENT_SM_CHANGE.  In this function. The ib_cache_update function
> which is ultimately called is updating local copies of the pkey table,
> gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
> such it does not need to be called on an SM LID change. It technically
> also does not need to be called on a reregistration. The LID_CHANGE,
> PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
> PORT_ACTICE) should be sufficient triggers.
> 
> It is worth noting that the alternative of simply having the hfi1 and
> qib drivers not generate the SM LID change event was explored. While
> this would duplicate what Mellanox drivers do now, it is not the correct
> behavior and removes the ability for an SM to migrate without requiring
> reregistration. Since both opensm and OPA SM have mechanisms to backup
> or synchronize registration information, it is desirable to let them
> perform SM migrations (with LID or SL changes) without requiring
> reregistration when they deem it appropriate.
> 
> Suggested-by: Todd Rimmer <todd.rimmer@xxxxxxxxx>
> Tested-by: Michael Brooks <michael.brooks@xxxxxxxxx>
> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx>
> Reviewed-by: Todd Rimmer <todd.rimmer@xxxxxxxxx>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@xxxxxxxxx>
> ---
>  drivers/infiniband/core/cache.c            |    1 -
>  drivers/infiniband/core/multicast.c        |    1 -
>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c |    3 +--
>  3 files changed, 1 insertions(+), 4 deletions(-)

Applied to for-next, thanks

Jason



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux