Re: Is there a working cache for path record and lids etc for librdmacm?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 17 Nov 2020, Jens Domke wrote:

> I have used ibacm successfully years ago (think somewhere in the
> 2013-2015 timeframe) but abandoned the approach because some
> measurements indicated that using OpenMPI with rdmacm had a big
> runtime overhead compared to using OpenMPI+oob (Mellanox was
> informed but I'm unsure how much has changed until now)

Mellanox does not support ibacm.... But ok. Thanks. Good to know someone
that has actually used it.

> > Is there something that can locally cache the results of the SM queries to
> > avoid additional requests?
>
> Not that I know of, but others might know better. Maybe try contacting
> Sean Hefty (driver behind ibacm) directly if he missed your email here
> on the list.


I have talked to Ira Weiny who wax the last one who did major changes to
the source but he does not know of any alternate solution.

> > We have tried IBACM but the address resolution does not work on it. It is
> > unable to complete a request for any address resolution and leaves kernel
> > threads that never terminate instead.
>
> Setting up ibacm was/is painful, maybe you could verify that it works on
> a test bed with lowlevel rdmacm tools to debug with ping-pong, etc.

That was done and the bug was confirmed. There is bitrot there in the MAD
communication layer.

> Furthermore, another thing I learned the hard way was that a cold cache
> can overwhelm opensm as well. So, if you deploy ibacm, you have to make
> sure that not too many requests go to the local ibacm on too many nodes
> simultaneously right after starting ibacm service, otherwise having all
> nodes sending numerous requests to opensm could timeout -> could be the
> reason for your stalled kernel threads.

Right But our cluster only has around 200 nodes max. Should be fine.

> (another explanation is obviously a bug in ibacm and/or incompatibility
> to newer versions of librdmacm or opensm or other IB libs)
>
> Sorry, that I cannot provide more specific and direct help, but maybe my
> pointers can help you solve the issue.

Thanks.




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux