> I am trying to add rdmacm support to Platform MPI. I noticed that the > performance on our test cluster was very poor for creating connections. > For 12 processes on 12 hosts to create n^^2 connections takes about 12 > seconds. I also discovered that if I create some TCP sockets and use > those to ensure that only one process at a time is calling rdmacm_connect > to any target at a time, that the performance changes dramatically and > that I can then connected the 12 processes very quickly (didn't measure > exactly, but similar to our old rdma code). The order in which I am > connecting processes avoids flooding a single target with many > rdmacm_connects at once, but it is difficult to avoid the case where 2 > processes call dmacm_connect to the same target at roughly the same time > except when using my extra TCP socket connections. I haven't played with > MPICH code yet to see if they have the same issue, but will try that next. > > > Our test cluster is a bit old: > > 09:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 > 5GT/s - IB QDR / 10GigE] (rev b0) > > Is this a known problem? Are you aware of any issues that would shed some > light on this? This is the first I've heard of slow connect times. Are you sure that the time is coming from rdma_connect, versus route or address resolution? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html