On Thu, Feb 20, 2025 at 6:05 PM Jacob Moroni <jmoroni@xxxxxxxxxx> wrote: > > In workloads where there are many processes establishing > connections using RDMA CM in parallel (large scale MPI), > there can be heavy contention for mad_agent_lock in > cm_alloc_msg. > > This contention can occur while inside of a spin_lock_irq > region, leading to interrupts being disabled for extended > durations on many cores. Furthermore, it leads to the > serialization of rdma_create_ah calls, which has negative > performance impacts for NICs which are capable of processing > multiple address handle creations in parallel. > > The end result is the machine becoming unresponsive, hung > task warnings, netdev TX timeouts, etc. > > Since the lock appears to be only for protection from > cm_remove_one, it can be changed to a rwlock to resolve > these issues. > > Reproducer: > > Server: > for i in $(seq 1 512); do > ucmatose -c 32 -p $((i + 5000)) & > done > > Client: > for i in $(seq 1 512); do > ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 & > done > > Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and > mad_agent with kref and lock") Fixes: tag should be on a single line. > Signed-off-by: Jacob Moroni <jmoroni@xxxxxxxxxx> > --- It seems your patch is mangled. Can you use "git send-email" to resend it ?