On Fri, Jan 19, 2018 at 8:57 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: > On Fri, Jan 19, 2018 at 7:21 AM, Majd Dibbiny <majd@xxxxxxxxxxxx> wrote: >> >>> On Jan 19, 2018, at 1:09 PM, Leon Romanovsky <leon@xxxxxxxxxx> wrote: >>> >>>> On Thu, Jan 18, 2018 at 11:13:08AM -0500, Olga Kornievskaia wrote: >>>>> On Wed, Jan 17, 2018 at 5:03 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>>>>> On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote: >>>>>>> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote: >>>>>>>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>>>>>>> On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>>> Ok. The memory probably doesn't matter. Maybe run krping client and >>>>>>>>>> >>>>>>>>>> server on the same host (to use hw-loopback), and see if it works on both, >>>>>>>>>> one, or neither systems when they are both the client and server. >>>>>>>>>> >>>>>>>>>> Loopback on the original "server" machine produces the same failure. >>>>>>>>>> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error >>>>>>>>>> cqe >>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000 >>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000 >>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000 >>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2 >>>>>>>>>> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with >>>>>>>>>> wr_id 0 status 4 opcode 0 vender_err 32 >>>>>>>>> >>>>>>>>> Can someone from Mellanox comment more on the above CQE error? What exactly is it tell us? >>>>>>>>> >>>>>>>>>> >>>>>>>>>> What does this means? >>>>>>>>> >>>>>>>>> Not sure. But it does seem to be tied to that specific machine. Question: Is an IOMMU enabled on that system? >>>>>>>> >>>>>>>> IOMMU (Inter's VT-d) is enabled in BIOS (on both machines). >>>>>>>> >>>>>>>>> Perhaps that is exposing a dma mapping problem with krping? >>>>>>> >>>>>>> I have replaces the CX-5 card with another one and I no longer see the >>>>>>> krping problem. I think it speaks that it's a card issue... >>>>>> >>>>>> Check the firmware on the bad card. Lots of issues disappear if you >>>>>> have older firmware and update to the latest. >>>>> >>>>> That's a valid point. A check of firmware versions is needed. At the >>>>> time of the problem, I believe I had two machines that each had same >>>>> firmware versions. After card replacement, the replacement card >>>>> displays newer firmware. >>>> >>>> I have upgraded the firmware on both machines involved to the latest >>>> available firmware for the card and now I'm in the situation where >>>> krping does not work on either machine --- when either of them is a >>>> server it fails with the same information in the var log messages: >>> >>> Doesn't it mean that the issue in FW? >> Did you do cold reboot after FW upgrade? > > No I have done so. Firmware update instruction were to either > mlxfwreset or reboot (which i assumed would be warm). I will try a > cold reboot. > I have cold rebooted the machines and still have the same problem with krping. >>>> Jan 18 11:05:54 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe >>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000 >>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000 >>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000 >>>> Jan 18 11:05:54 localhost kernel: 00000000 93003204 10000122 0005bfd2 >>>> Jan 18 11:05:54 localhost kernel: krping: cq completion failed with >>>> wr_id 0 status 4 opcode 128 vender_err 32 >>>> Jan 18 11:05:54 localhost kernel: krping: cq completion in ERROR state >>>> Jan 18 11:05:54 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10 >>>> >>>> client side logs: >>>> Jan 18 11:14:30 localhost kernel: krping: DISCONNECT EVENT... >>>> Jan 18 11:14:30 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10 >>>> Jan 18 11:14:30 localhost kernel: krping: cq completion in ERROR state -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html