RE: Announcing IBM Platform MPI 9.1.2.1 FixPack

"Hefty, Sean" <sean.hefty@xxxxxxxxx> · Tue, 6 May 2014 20:48:09 +0000



> I am trying to add rdmacm support to Platform MPI.   I noticed that the
> performance on our test cluster was very poor for creating connections.
> For 12 processes on 12 hosts to create n^^2 connections takes about 12
> seconds.   I also discovered that if I create some TCP sockets and use
> those to ensure that only one process at a time is calling rdmacm_connect
> to any target at a time, that the performance changes dramatically and
> that I can then connected the 12 processes very quickly (didn't measure
> exactly, but similar to our old rdma code).    The order in which I am
> connecting processes avoids flooding a single target with many
> rdmacm_connects at once, but it is difficult to avoid the case where 2
> processes call dmacm_connect to the same target at roughly the same time
> except when using my extra TCP socket connections.   I haven't played with
> MPICH code yet to see if they have the same issue, but will try that next.
> 
> 
> Our test cluster is a bit old:
> 
> 09:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0
> 5GT/s - IB QDR / 10GigE] (rev b0)
> 
> Is this a known problem?  Are you aware of any issues that would shed some
> light on this?

This is the first I've heard of slow connect times.  Are you sure that the time is coming from rdma_connect, versus route or address resolution?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html