Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




----- Original Message -----
> From: "Laurence Oberman" <loberman@xxxxxxxxxx>
> To: "Leon Romanovsky" <leon@xxxxxxxxxx>
> Cc: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>, hch@xxxxxx, maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx,
> linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx
> Sent: Monday, February 13, 2017 11:12:55 AM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman@xxxxxxxxxx>
> > To: "Leon Romanovsky" <leon@xxxxxxxxxx>
> > Cc: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>, hch@xxxxxx,
> > maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx,
> > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx
> > Sent: Monday, February 13, 2017 9:24:01 AM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Leon Romanovsky" <leon@xxxxxxxxxx>
> > > To: "Laurence Oberman" <loberman@xxxxxxxxxx>
> > > Cc: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>, hch@xxxxxx,
> > > maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx,
> > > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx
> > > Sent: Monday, February 13, 2017 9:17:24 AM
> > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > QP
> > > 
> > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote:
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > From: "Laurence Oberman" <loberman@xxxxxxxxxx>
> > > > > To: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>
> > > > > Cc: leon@xxxxxxxxxx, hch@xxxxxx, maxg@xxxxxxxxxxxx,
> > > > > israelr@xxxxxxxxxxxx,
> > > > > linux-rdma@xxxxxxxxxxxxxxx,
> > > > > dledford@xxxxxxxxxx
> > > > > Sent: Sunday, February 12, 2017 10:14:53 PM
> > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > destroying
> > > > > a
> > > > > QP
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Laurence Oberman" <loberman@xxxxxxxxxx>
> > > > > > To: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>
> > > > > > Cc: leon@xxxxxxxxxx, hch@xxxxxx, maxg@xxxxxxxxxxxx,
> > > > > > israelr@xxxxxxxxxxxx,
> > > > > > linux-rdma@xxxxxxxxxxxxxxx,
> > > > > > dledford@xxxxxxxxxx
> > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM
> > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > destroying
> > > > > > a
> > > > > > QP
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>
> > > > > > > To: leon@xxxxxxxxxx, loberman@xxxxxxxxxx
> > > > > > > Cc: hch@xxxxxx, maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx,
> > > > > > > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx
> > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > > destroying a
> > > > > > > QP
> > > > > > >
> > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > > > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0
> > > > > > > > [ib_core]
> > > > > > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> > > > > > >
> > > > > > > Hello Laurence,
> > > > > > >
> > > > > > > That warning has been removed by patch 7/8 of this series. Please
> > > > > > > double
> > > > > > > check
> > > > > > > whether all eight patches have been applied properly.
> > > > > > >
> > > > > > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > > > > >
> > > > > > Hello
> > > > > > Just a heads up, working with Bart on this patch series.
> > > > > > We have stability issues with my tests in my MLX5 EDR-100 test bed.
> > > > > > Thanks
> > > > > > Laurence
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > linux-rdma"
> > > > > > in
> > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > >
> > > > > I went back to Linus' latest tree for a baseline and we fail the same
> > > > > way.
> > > > > This has none of the latest 8 patches applied so we will
> > > > > have to figure out what broke this.
> > > > >
> > > > > Dont forget that I tested all this recently with Bart's dma patch
> > > > > series
> > > > > and its solid.
> > > > >
> > > > > Will come back to this tomorrow and see what recently made it into
> > > > > Linus's
> > > > > tree by
> > > > > checking back with Doug.
> > > > >
> > > > > [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff880bd4270eb0
> > > > > [  183.853047] 00000000 00000000 00000000 00000000
> > > > > [  183.878425] 00000000 00000000 00000000 00000000
> > > > > [  183.903243] 00000000 00000000 00000000 00000000
> > > > > [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> > > > > [  198.538593] scsi host1: ib_srp: reconnect succeeded
> > > > > [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> > > > > [  198.603037] 00000000 00000000 00000000 00000000
> > > > > [  198.628884] 00000000 00000000 00000000 00000000
> > > > > [  198.653961] 00000000 00000000 00000000 00000000
> > > > > [  198.680021] 00000000 0f007806 25000032 00105dd0
> > > > > [  198.705985] scsi host1: ib_srp: failed FAST REG status memory
> > > > > management
> > > > > operation error (6) for CQE ffff880b92860138
> > > > > [  213.532848] scsi host1: ib_srp: reconnect succeeded
> > > > > [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  227.579684] scsi host1: ib_srp: reconnect succeeded
> > > > > [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  242.633925] scsi host1: ib_srp: reconnect succeeded
> > > > > [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  257.127715] scsi host1: ib_srp: reconnect succeeded
> > > > > [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  272.225762] scsi host1: ib_srp: reconnect succeeded
> > > > > [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  286.350226] scsi host1: ib_srp: reconnect succeeded
> > > > > [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  301.109365] scsi host1: ib_srp: reconnect succeeded
> > > > > [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  315.910860] scsi host1: ib_srp: reconnect succeeded
> > > > > [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  330.551052] scsi host1: ib_srp: reconnect succeeded
> > > > > [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  344.998448] scsi host1: ib_srp: reconnect succeeded
> > > > > [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  359.866731] scsi host1: ib_srp: reconnect succeeded
> > > > > [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > ..
> > > > > ..
> > > > > [  373.113045] scsi host1: ib_srp: reconnect succeeded
> > > > > [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > > > > [  388.589517] scsi host1: ib_srp: reconnect succeeded
> > > > > [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  403.086893] scsi host1: ib_srp: reconnect succeeded
> > > > > [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > [  403.140403] 00000000 00000000 00000000 00000000
> > > > > [  403.140403] 00
> > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > > > > in
> > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > Hello
> > > >
> > > > Let summarize where we are and how we got here.
> > > >
> > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4
> > > > with
> > > > Barts dma patches.
> > > > All tests passed.
> > > >
> > > > I pulled Linus's tree and applied all 8 patches of the above series and
> > > > we
> > > > failed in the
> > > > "failed FAST REG status memory management" area.
> > > >
> > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I
> > > > thought patch 6 of the series
> > > > may have been the catalyst.
> > > >
> > > > This also failed.
> > > >
> > > > Building from Barts tree which is based on 4.10.0-rc7 failed again.
> > > >
> > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail.
> > > >
> > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and
> > > > ib_srp.
> > > 
> > > From infiniband side:
> > > ➜  linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 --
> > > drivers/inifiniband |wc
> > >       0       0       0
> > > 
> > > From eth nothing suspicious too:
> > > ➜  linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 --
> > > drivers/net/ethernet/mellanox/mlx5
> > > d15118af2683 net/mlx5e: Check ets capability before ets query FW command
> > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool
> > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed
> > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper
> > > devices
> > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after
> > > FDB
> > > destroy
> > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space
> > > fails
> > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering
> > > name-space
> > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP
> > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve
> > > ad05df399f33 net/mlx5e: Remove unused variable
> > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num
> > > channels
> > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning
> > > 
> > > 
> > > >
> > > > Thanks
> > > > Laurence
> > > 
> > 
> > Hi Leon,
> > Yep, I also looked for outliers here that may look suspicious and did not
> > see
> > any.
> > 
> > I guess I will have to start bisecting.
> > I will start with rc5, if that fails will bisect between rc4 and rc5, as we
> > know rc4 was fine.
> > 
> > I did re-run tests on rc4 last night and I was stable.
> > 
> > Thanks
> > Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting.
> Unless one of you think you know what may be causing this in rc6.
> This will take time so will come back to the list once I have it isolated.
> 
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Bisect has 8 possible kernel builds, 200 + changes, started the first one.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux