> -----Original Message----- > From: Bob Pearson <rpearsonhpe@xxxxxxxxx> > Sent: Wednesday, 23 August 2023 18:19 > To: Bart Van Assche <bvanassche@xxxxxxx>; Shinichiro Kawasaki > <shinichiro.kawasaki@xxxxxxx> > Cc: linux-rdma@xxxxxxxxxxxxxxx; linux-scsi@xxxxxxxxxxxxxxx > Subject: [EXTERNAL] Re: [bug report] blktests srp/002 hang > > On 8/22/23 10:20, Bart Van Assche wrote: > > On 8/22/23 03:18, Shinichiro Kawasaki wrote: > >> CC+: Bart, > >> > >> On Aug 21, 2023 / 20:46, Bob Pearson wrote: > >> [...] > >>> Shinichiro, > >> > >> Hello Bob, thanks for the response. > >> > >>> > >>> I have been aware for a long time that there is a problem with > blktests/srp. I see hangs in > >>> 002 and 011 fairly often. > >> > >> I repeated the test case srp/011, and observed it hangs. This hang at > srp/011 > >> also can be recreated in stable manner. I reverted the commit > 9b4b7c1f9f54 > >> then observed the srp/011 hang disappeared. So, I guess these two hangs > have > >> same root cause. > >> > >>> I have not been able to figure out the root cause but suspect that > >>> there is a timing issue in the srp drivers which cannot handle the > slowness of the software > >>> RoCE implemtation. If you can give me any clues about what you are > seeing I am happy to help > >>> try to figure this out. > >> > >> Thanks for sharing your thoughts. I myself do not have srp driver > knowledge, and > >> not sure what clue I should provide. If you have any idea of the action > I can > >> take, please let me know. > > > > Hi Shinichiro and Bob, > > > > When I initially developed the SRP tests these were working reliably in > > combination with the rdma_rxe driver. Since 2017 I frequently see issues > when > > running the SRP tests on top of the rdma_rxe driver, issues that I do not > see > > if I run the SRP tests on top of the soft-iWARP driver (siw). How about > > changing the default for the SRP tests from rdma_rxe to siw and to let > the > > RDMA community resolve the rdma_rxe issues? > > > > Thanks, > > > > Bart. > > > > Bart, > > I have also seen the same hangs in siw. Not as frequently but the same > symptoms. > About every month or so I take another run at trying to find and fix this > bug but > I have not succeeded yet. I haven't seen anything that looks like bad > behavior from > the rxe side but that doesn't prove anything. I also saw these hangs on my > system > before the WQ patch went in if my memory serves. Out main application for > this > driver at HPE is Lustre which is a little different than SRP but uses the > same > general approach with fast MRs. Currently we are finding the driver to be > quite stable > even under very heavy stress. > > I would be happy to collaborate with someone (you?) who knows the SRP side > well to resolve > this hang. I think that is the quickest way to fix this. I have no idea > what SRP is waiting for. > > Best regards, > > Bob Hi Bart, I spent some time testing the srp/002 blktest with siw, still trying to get it hanging. Looking closer into the logs: While most of the time RDMA CM connection setup works, I also see some connection rejects being created by the passive ULP side during setup: [16848.757937] scsi host11: ib_srp: REJ received [16848.757939] scsi host11: REJ reason 0xffffff98 This does not affect the overall success of the current test run, other connect attempts succeed etc. Is that connection rejection intended behavior of the test? Thanks! Bernard.