----- Original Message ----- > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > To: "Leon Romanovsky" <leon@xxxxxxxxxx> > Cc: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>, hch@xxxxxx, maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx, > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx > Sent: Monday, February 13, 2017 11:47:31 AM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > > To: "Leon Romanovsky" <leon@xxxxxxxxxx> > > Cc: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>, hch@xxxxxx, > > maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx, > > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx > > Sent: Monday, February 13, 2017 11:12:55 AM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > > > > > ----- Original Message ----- > > > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > > > To: "Leon Romanovsky" <leon@xxxxxxxxxx> > > > Cc: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>, hch@xxxxxx, > > > maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx, > > > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx > > > Sent: Monday, February 13, 2017 9:24:01 AM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Leon Romanovsky" <leon@xxxxxxxxxx> > > > > To: "Laurence Oberman" <loberman@xxxxxxxxxx> > > > > Cc: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx>, hch@xxxxxx, > > > > maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx, > > > > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx > > > > Sent: Monday, February 13, 2017 9:17:24 AM > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying > > > > a > > > > QP > > > > > > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > > > > > > To: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx> > > > > > > Cc: leon@xxxxxxxxxx, hch@xxxxxx, maxg@xxxxxxxxxxxx, > > > > > > israelr@xxxxxxxxxxxx, > > > > > > linux-rdma@xxxxxxxxxxxxxxx, > > > > > > dledford@xxxxxxxxxx > > > > > > Sent: Sunday, February 12, 2017 10:14:53 PM > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > destroying > > > > > > a > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > > > > > > > To: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx> > > > > > > > Cc: leon@xxxxxxxxxx, hch@xxxxxx, maxg@xxxxxxxxxxxx, > > > > > > > israelr@xxxxxxxxxxxx, > > > > > > > linux-rdma@xxxxxxxxxxxxxxx, > > > > > > > dledford@xxxxxxxxxx > > > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > destroying > > > > > > > a > > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Bart Van Assche" <Bart.VanAssche@xxxxxxxxxxx> > > > > > > > > To: leon@xxxxxxxxxx, loberman@xxxxxxxxxx > > > > > > > > Cc: hch@xxxxxx, maxg@xxxxxxxxxxxx, israelr@xxxxxxxxxxxx, > > > > > > > > linux-rdma@xxxxxxxxxxxxxxx, dledford@xxxxxxxxxx > > > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > > destroying a > > > > > > > > QP > > > > > > > > > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > > > > > drivers/infiniband/core/verbs.c:1959 > > > > > > > > > __ib_drain_sq+0x1bb/0x1c0 > > > > > > > > > [ib_core] > > > > > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for > > > > > > > > > drain > > > > > > > > > > > > > > > > Hello Laurence, > > > > > > > > > > > > > > > > That warning has been removed by patch 7/8 of this series. > > > > > > > > Please > > > > > > > > double > > > > > > > > check > > > > > > > > whether all eight patches have been applied properly. > > > > > > > > > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > > > > > > > > > Hello > > > > > > > Just a heads up, working with Bart on this patch series. > > > > > > > We have stability issues with my tests in my MLX5 EDR-100 test > > > > > > > bed. > > > > > > > Thanks > > > > > > > Laurence > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > linux-rdma" > > > > > > > in > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > More majordomo info at > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > > I went back to Linus' latest tree for a baseline and we fail the > > > > > > same > > > > > > way. > > > > > > This has none of the latest 8 patches applied so we will > > > > > > have to figure out what broke this. > > > > > > > > > > > > Dont forget that I tested all this recently with Bart's dma patch > > > > > > series > > > > > > and its solid. > > > > > > > > > > > > Will come back to this tomorrow and see what recently made it into > > > > > > Linus's > > > > > > tree by > > > > > > checking back with Doug. > > > > > > > > > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff880bd4270eb0 > > > > > > [ 183.853047] 00000000 00000000 00000000 00000000 > > > > > > [ 183.878425] 00000000 00000000 00000000 00000000 > > > > > > [ 183.903243] 00000000 00000000 00000000 00000000 > > > > > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > > > > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > > > > > [ 198.603037] 00000000 00000000 00000000 00000000 > > > > > > [ 198.628884] 00000000 00000000 00000000 00000000 > > > > > > [ 198.653961] 00000000 00000000 00000000 00000000 > > > > > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > > > > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory > > > > > > management > > > > > > operation error (6) for CQE ffff880b92860138 > > > > > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > .. > > > > > > .. > > > > > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > > > > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > [ 403.140403] 00000000 00000000 00000000 00000000 > > > > > > [ 403.140403] 00 > > > > > > > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > linux-rdma" > > > > > > in > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > Hello > > > > > > > > > > Let summarize where we are and how we got here. > > > > > > > > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 > > > > > with > > > > > Barts dma patches. > > > > > All tests passed. > > > > > > > > > > I pulled Linus's tree and applied all 8 patches of the above series > > > > > and > > > > > we > > > > > failed in the > > > > > "failed FAST REG status memory management" area. > > > > > > > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I > > > > > thought patch 6 of the series > > > > > may have been the catalyst. > > > > > > > > > > This also failed. > > > > > > > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > > > > > > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail. > > > > > > > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and > > > > > ib_srp. > > > > > > > > From infiniband side: > > > > ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- > > > > drivers/inifiniband |wc > > > > 0 0 0 > > > > > > > > From eth nothing suspicious too: > > > > ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- > > > > drivers/net/ethernet/mellanox/mlx5 > > > > d15118af2683 net/mlx5e: Check ets capability before ets query FW > > > > command > > > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool > > > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed > > > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper > > > > devices > > > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only > > > > after > > > > FDB > > > > destroy > > > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering > > > > name-space > > > > fails > > > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering > > > > name-space > > > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP > > > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve > > > > ad05df399f33 net/mlx5e: Remove unused variable > > > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num > > > > channels > > > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > > > > > > > > > > > > > > > > > Thanks > > > > > Laurence > > > > > > > > > > Hi Leon, > > > Yep, I also looked for outliers here that may look suspicious and did not > > > see > > > any. > > > > > > I guess I will have to start bisecting. > > > I will start with rc5, if that fails will bisect between rc4 and rc5, as > > > we > > > know rc4 was fine. > > > > > > I did re-run tests on rc4 last night and I was stable. > > > > > > Thanks > > > Laurence > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting. > > Unless one of you think you know what may be causing this in rc6. > > This will take time so will come back to the list once I have it isolated. > > > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Bisect has 8 possible kernel builds, 200 + changes, started the first one. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bisecting got me to this commit, I had reviewed this looking for an explanation at some point. At the time, I did not understand the need for the change but after explanation I accepted it. I reverted this and we are good again but reading the code, not seeing how this is affecting us. Makes no sense how this can be the issue. Nevertheless we will need to revert this please. I will now apply the 8 patches from Bart to Linus's tree with this reverted and test again. Bisect run git bisect start git bisect bad 566cf877a1fcb6d6dc0126b076aad062054c2637 git bisect good 7a308bb3016f57e5be11a677d15b821536419d36 git bisect good git bisect good git bisect bad git bisect bad git bisect bad git bisect bad git bisect good Bisecting: 0 revisions left to test after this (roughly 1 step) [0a475ef4226e305bdcffe12b401ca1eab06c4913] IB/srp: fix invalid indirect_sg_entries parameter value [loberman@ibclient linux-torvalds]$ git show 0a475ef4226e305bdcffe12b401ca1eab06c4913 commit 0a475ef4226e305bdcffe12b401ca1eab06c4913 Author: Israel Rukshin <israelr@xxxxxxxxxxxx> Date: Wed Jan 4 15:59:37 2017 +0200 IB/srp: fix invalid indirect_sg_entries parameter value After setting indirect_sg_entries module_param to huge value (e.g 500,000), srp_alloc_req_data() fails to allocate indirect descriptors for the request ring (kmalloc fails). This commit enforces the maximum value of indirect_sg_entries to be SG_MAX_SEGMENTS as signified in module param description. Fixes: 65e8617fba17 (scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS) Fixes: c07d424d6118 (IB/srp: add support for indirect tables that don't fit in SRP_CMD) Cc: stable@xxxxxxxxxxxxxxx # 4.7+ Signed-off-by: Israel Rukshin <israelr@xxxxxxxxxxxx> Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx> Reviewed-by: Laurence Oberman <loberman@xxxxxxxxxx> Reviewed-by: Bart Van Assche <bart.vanassche@xxxxxxxxxxx>-- Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 0f67cf9..79bf484 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3699,6 +3699,12 @@ static int __init srp_init_module(void) indirect_sg_entries = cmd_sg_entries; } + if (indirect_sg_entries > SG_MAX_SEGMENTS) { + pr_warn("Clamping indirect_sg_entries to %u\n", + SG_MAX_SEGMENTS); + indirect_sg_entries = SG_MAX_SEGMENTS; + } + srp_remove_wq = create_workqueue("srp_remove"); if (!srp_remove_wq) { ret = -ENOMEM; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html