> Hi folks, > > I have 2 linux machines with CX-5 cards (Mellanox MCX515A-CCAT (one > port)) and krping doesn't work in one direction but works in another. > rping works in both direction. ib_send_bw works in both directions and > display 39Gb one way and 36Gb other way on a 40Gb setup. > > krping is upstream commit 4df520c888d80e5370d0f58b2eeac8355e3f2286. > > Server is started with: [kolga@localhost krping]$ sudo echo > "server,port=9999,addr=172.20.35.191,count=10,verbose" > /proc/krping > And it displays in /var/log/messages: > Jan 4 14:23:29 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe > Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000 > Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000 > Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000 > Jan 4 14:23:29 localhost kernel: 00000000 93003204 10000122 0005bfd2 > Jan 4 14:23:29 localhost kernel: krping: cq completion failed with > wr_id 0 status 4 opcode 128 vender_err 32 > Jan 4 14:23:29 localhost kernel: krping: cq completion in ERROR state > Jan 4 14:23:29 localhost kernel: krping: wait for RDMA_READ_COMPLETE state > 10 > > Client is run with: [kolga@sti-rx200-231-d1 ~]$ sudo echo > "client,addr=172.20.35.191,port=9999,verbose,count=10" > /proc/krping > And in var log messages: > Jan 4 14:19:27 localhost kernel: krping: DISCONNECT EVENT... > Jan 4 14:19:27 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10 > Jan 4 14:19:28 localhost kernel: krping: cq completion in ERROR state > > On the network trace is see (over RRoCE): > CM: ConnectRequest > CM: ConnectReply > CM: ReadyToUse > RC Send Only QP > RC Ack > RC RDMA Read Request > RC RDMA Read Response Only > CM: DisconnectRequest > CM: DisconnectReply > > I have previously submitted it to Mellanox but they told me to > resubmit to linux-rdma list: They also said the engineering did look > at the CQE error and the meaning of it was: > PD (protection domain) violation - error in fetch data in rxs in pd > (send opcodes/ read respond / atomic ack). Hey Olga, Are the machines the same kernel version / distro sw / and hw - cpu/motherboard/memory/etc? If not, what is different about them? Is it the krping server that sees the CQ error? Do other rdma devices work on these systems? Thanks, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html