Re: [bug report] blktests srp/002 hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/22/23 10:20, Bart Van Assche wrote:
> On 8/22/23 03:18, Shinichiro Kawasaki wrote:
>> CC+: Bart,
>>
>> On Aug 21, 2023 / 20:46, Bob Pearson wrote:
>> [...]
>>> Shinichiro,
>>
>> Hello Bob, thanks for the response.
>>
>>>
>>> I have been aware for a long time that there is a problem with blktests/srp. I see hangs in
>>> 002 and 011 fairly often.
>>
>> I repeated the test case srp/011, and observed it hangs. This hang at srp/011
>> also can be recreated in stable manner. I reverted the commit 9b4b7c1f9f54
>> then observed the srp/011 hang disappeared. So, I guess these two hangs have
>> same root cause.
>>
>>> I have not been able to figure out the root cause but suspect that
>>> there is a timing issue in the srp drivers which cannot handle the slowness of the software
>>> RoCE implemtation. If you can give me any clues about what you are seeing I am happy to help
>>> try to figure this out.
>>
>> Thanks for sharing your thoughts. I myself do not have srp driver knowledge, and
>> not sure what clue I should provide. If you have any idea of the action I can
>> take, please let me know.
> 
> Hi Shinichiro and Bob,
> 
> When I initially developed the SRP tests these were working reliably in
> combination with the rdma_rxe driver. Since 2017 I frequently see issues when
> running the SRP tests on top of the rdma_rxe driver, issues that I do not see
> if I run the SRP tests on top of the soft-iWARP driver (siw). How about
> changing the default for the SRP tests from rdma_rxe to siw and to let the
> RDMA community resolve the rdma_rxe issues?
> 
> Thanks,
> 
> Bart.
> 

Bart,

I have also seen the same hangs in siw. Not as frequently but the same symptoms.
About every month or so I take another run at trying to find and fix this bug but
I have not succeeded yet. I haven't seen anything that looks like bad behavior from 
the rxe side but that doesn't prove anything. I also saw these hangs on my system
before the WQ patch went in if my memory serves. Out main application for this
driver at HPE is Lustre which is a little different than SRP but uses the same
general approach with fast MRs. Currently we are finding the driver to be quite stable
even under very heavy stress.

I would be happy to collaborate with someone (you?) who knows the SRP side well to resolve
this hang. I think that is the quickest way to fix this. I have no idea what SRP is waiting for.

Best regards,

Bob 



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux