On 8/22/23 10:20, Bart Van Assche wrote: > On 8/22/23 03:18, Shinichiro Kawasaki wrote: >> CC+: Bart, >> >> On Aug 21, 2023 / 20:46, Bob Pearson wrote: >> [...] >>> Shinichiro, >> >> Hello Bob, thanks for the response. >> >>> >>> I have been aware for a long time that there is a problem with blktests/srp. I see hangs in >>> 002 and 011 fairly often. >> >> I repeated the test case srp/011, and observed it hangs. This hang at srp/011 >> also can be recreated in stable manner. I reverted the commit 9b4b7c1f9f54 >> then observed the srp/011 hang disappeared. So, I guess these two hangs have >> same root cause. >> >>> I have not been able to figure out the root cause but suspect that >>> there is a timing issue in the srp drivers which cannot handle the slowness of the software >>> RoCE implemtation. If you can give me any clues about what you are seeing I am happy to help >>> try to figure this out. >> >> Thanks for sharing your thoughts. I myself do not have srp driver knowledge, and >> not sure what clue I should provide. If you have any idea of the action I can >> take, please let me know. > > Hi Shinichiro and Bob, > > When I initially developed the SRP tests these were working reliably in > combination with the rdma_rxe driver. Since 2017 I frequently see issues when > running the SRP tests on top of the rdma_rxe driver, issues that I do not see > if I run the SRP tests on top of the soft-iWARP driver (siw). How about > changing the default for the SRP tests from rdma_rxe to siw and to let the > RDMA community resolve the rdma_rxe issues? > > Thanks, > > Bart. > Bart, I have also seen the same hangs in siw. Not as frequently but the same symptoms. About every month or so I take another run at trying to find and fix this bug but I have not succeeded yet. I haven't seen anything that looks like bad behavior from the rxe side but that doesn't prove anything. I also saw these hangs on my system before the WQ patch went in if my memory serves. Out main application for this driver at HPE is Lustre which is a little different than SRP but uses the same general approach with fast MRs. Currently we are finding the driver to be quite stable even under very heavy stress. I would be happy to collaborate with someone (you?) who knows the SRP side well to resolve this hang. I think that is the quickest way to fix this. I have no idea what SRP is waiting for. Best regards, Bob