----- Original Message ----- > From: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx> > To: leonro@xxxxxxxxxxxx, loberman@xxxxxxxxxx > Cc: maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx, linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx, sagi@xxxxxxxxxxx > Sent: Tuesday, April 25, 2017 11:39:12 PM > Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array > > On Tue, 2017-04-25 at 16:37 -0400, Laurence Oberman wrote: > > Hello Bart, Leon, Max and Israel. > > > > I cloned off Barts tree. > > > > git clone https://github.com/bvanassche/linux > > cd linux > > git checkout block-scsi-for-next > > > > I checked all patches were in for this test. > > > > a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS > > dfa5a2b mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array > > f759c80 mlx5: Fix mlx5_ib_map_mr_sg mr lengt > > > > Built and tested the kernel. > > > > However this issue is not resolved :( > > > > [ 2707.931909] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > CQE ffff8817edca86b0 > > [ 2708.089806] mlx5_0:dump_cqe:262:(pid 20129): dump error cqe > > [ 2708.121342] 00000000 00000000 00000000 00000000 > > [ 2708.147104] 00000000 00000000 00000000 00000000 > > [ 2708.172633] 00000000 00000000 00000000 00000000 > > [ 2708.198702] 00000000 0f007806 2500002a 14a527d0 > > [ 2732.434127] scsi host1: ib_srp: reconnect succeeded > > [ 2733.048023] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > CQE ffff8817ed0a9c30 > > Hello Laurence, > > Thank you for having run this test. But are you aware that if a flush error > is reported at the initiator side that does not necessarily mean that there > is a bug at the initiator side? If e.g. the target system would initiate a > disconnect that would also trigger this kind of flush errors. What kind of > SRP target system was used in this test? Were the clocks of initiator and > target system synchronized? Are the logs of the target system available? If > so, can you have a look whether anything interesting can be found in the > target log around the time the initiator reported the flush error? > > Thanks, > > Bart. Hi Bart Its the same target that is stable for all other tests. This is the same issue I originally reported when we then reverted the SG+GAPS. Remember when I reverted that we were stable again. This happens on the initiator first [root@localhost ~]# [ 512.375904] mlx5_0:dump_cqe:262:(pid 4653): dump error cqe [ 512.376648] scsi host2: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817c596f770 [ 512.454276] 00000000 00000000 00000000 00000000 [ 512.478734] 00000000 00000000 00000000 00000000 [ 512.504170] 00000000 00000000 00000000 00000000 [ 512.529457] 00000000 0f007806 2500002a 0548e2d0 [ 532.128455] scsi host2: ib_srp: reconnect succeeded [ 532.232126] scsi host2: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf2bb3bf0 [ 532.780107] mlx5_0:dump_cqe:262:(pid 511): dump error cqe [ 532.811863] 00000000 00000000 00000000 00000000 [ 532.837984] 00000000 00000000 00000000 00000000 [ 532.863955] 00000000 00000000 00000000 00000000 [ 532.889885] 00000000 0f007806 25000032 00683bd0 Only afterwards do I see the target complain [root@fedstorage ~]# [ 537.105985] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-48. [ 537.152767] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-47. [ 537.200585] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-46. [ 537.247864] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-45. [ 537.296822] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-44. [ 537.345001] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-43. [ 537.394146] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-42. [ 537.442148] ib_srpt Received CM TimeWait exit for ch 0x4e6e72000390fe7c7cfe900300726ed2-41. [ 537.490011] ib_srpt sending response for ioctx 0xffff8800951ed800 failed with status 5 [ 539.774018] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) [ 539.887987] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) [ 540.001241] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) [ 540.111455] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) [ 540.224780] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) [ 540.340522] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) [ 540.453736] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) [ 540.567043] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x4e6e72000390fe7c:0x7cfe900300726ed2, t_port_id 0x7cfe900300726e4e:0x7cfe900300726e4e and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0x7cfe900300726e4e) Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html