Re: krping problem on 4.15-rc4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 19, 2018 at 8:57 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> On Fri, Jan 19, 2018 at 7:21 AM, Majd Dibbiny <majd@xxxxxxxxxxxx> wrote:
>>
>>> On Jan 19, 2018, at 1:09 PM, Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>>>
>>>> On Thu, Jan 18, 2018 at 11:13:08AM -0500, Olga Kornievskaia wrote:
>>>>> On Wed, Jan 17, 2018 at 5:03 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>> On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
>>>>>>> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
>>>>>>>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>> On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>> Ok.  The memory probably doesn't matter.  Maybe run krping client and
>>>>>>>>>>
>>>>>>>>>> server on the same host (to use hw-loopback), and see if it works on both,
>>>>>>>>>> one, or neither systems when they are both the client and server.
>>>>>>>>>>
>>>>>>>>>> Loopback on the original "server" machine produces the same failure.
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>>>>>>>>>> cqe
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>>>>>>>>>> wr_id 0 status 4 opcode 0 vender_err 32
>>>>>>>>>
>>>>>>>>> Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What does this means?
>>>>>>>>>
>>>>>>>>> Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
>>>>>>>>
>>>>>>>> IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
>>>>>>>>
>>>>>>>>> Perhaps that is exposing a dma mapping problem with krping?
>>>>>>>
>>>>>>> I have replaces the CX-5 card with another one and I no longer see the
>>>>>>> krping problem.  I think it speaks that it's a card issue...
>>>>>>
>>>>>> Check the firmware on the bad card.  Lots of issues disappear if you
>>>>>> have older firmware and update to the latest.
>>>>>
>>>>> That's a valid point. A check of firmware versions is needed. At the
>>>>> time of the problem, I believe I had two machines that each had same
>>>>> firmware versions. After card replacement, the replacement card
>>>>> displays newer firmware.
>>>>
>>>> I have upgraded the firmware on both machines involved to the latest
>>>> available firmware for the card and now I'm in the situation where
>>>> krping does not work on either machine --- when either of them is a
>>>> server it fails with the same information in the var log messages:
>>>
>>> Doesn't it mean that the issue in FW?
>> Did you do cold reboot after FW upgrade?
>
> No I have done so. Firmware update instruction were to either
> mlxfwreset or reboot (which i assumed would be warm). I will try a
> cold reboot.
>

I have cold rebooted the machines and still have the same problem with krping.

>>>> Jan 18 11:05:54 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
>>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>>> Jan 18 11:05:54 localhost kernel: 00000000 93003204 10000122 0005bfd2
>>>> Jan 18 11:05:54 localhost kernel: krping: cq completion failed with
>>>> wr_id 0 status 4 opcode 128 vender_err 32
>>>> Jan 18 11:05:54 localhost kernel: krping: cq completion in ERROR state
>>>> Jan 18 11:05:54 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10
>>>>
>>>> client side logs:
>>>> Jan 18 11:14:30 localhost kernel: krping: DISCONNECT EVENT...
>>>> Jan 18 11:14:30 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
>>>> Jan 18 11:14:30 localhost kernel: krping: cq completion in ERROR state
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux