Re: krping problem on 4.15-rc4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 10, 2018 at 3:10 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> wrote:
>> Hi folks,
>>
>> I have 2 linux machines with CX-5 cards (Mellanox MCX515A-CCAT (one
>> port)) and krping doesn't work in one direction but works in another.
>> rping works in both direction. ib_send_bw works in both directions and
>> display 39Gb one way and 36Gb other way on a 40Gb setup.
>>
>> krping is upstream commit 4df520c888d80e5370d0f58b2eeac8355e3f2286.
>>
>> Server is started with: [kolga@localhost krping]$ sudo echo
>> "server,port=9999,addr=172.20.35.191,count=10,verbose" > /proc/krping
>> And it displays in /var/log/messages:
>> Jan 4 14:23:29 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
>> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 4 14:23:29 localhost kernel: 00000000 93003204 10000122 0005bfd2
>> Jan 4 14:23:29 localhost kernel: krping: cq completion failed with
>> wr_id 0 status 4 opcode 128 vender_err 32
>> Jan 4 14:23:29 localhost kernel: krping: cq completion in ERROR state
>> Jan 4 14:23:29 localhost kernel: krping: wait for RDMA_READ_COMPLETE state
>> 10
>>
>> Client is run with: [kolga@sti-rx200-231-d1 ~]$ sudo echo
>> "client,addr=172.20.35.191,port=9999,verbose,count=10" > /proc/krping
>> And in var log messages:
>> Jan 4 14:19:27 localhost kernel: krping: DISCONNECT EVENT...
>> Jan 4 14:19:27 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
>> Jan 4 14:19:28 localhost kernel: krping: cq completion in ERROR state
>>
>> On the network trace is see (over RRoCE):
>> CM: ConnectRequest
>> CM: ConnectReply
>> CM: ReadyToUse
>> RC Send Only QP
>> RC Ack
>> RC RDMA Read Request
>> RC RDMA Read Response Only
>> CM: DisconnectRequest
>> CM: DisconnectReply
>>
>> I have previously submitted it to Mellanox but they told me to
>> resubmit to linux-rdma list: They also said the engineering did look
>> at the CQE error and the meaning of it was:
>> PD (protection domain) violation - error in fetch data in rxs in pd
>> (send opcodes/ read respond / atomic ack).
>
> Hey Olga,
>
> Are the machines the same kernel version / distro sw / and hw - cpu/motherboard/memory/etc?  If not, what is different about them?  Is it the krping server that sees the CQ error?  Do other rdma devices work on these systems?

Hi Steve,

Machines software is the same kernel version (4.15-rc4) / distro sw
(RHEL7.4). Hardware of those machines the same (PRIMERGY RX200 S7) but
one machine has 8G less memory than the other (64G vs 56G). kpring
error was on the server. These machines only have 1 CX-5 no other RDMA
devices.

>
> Thanks,
>
> Steve.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux