Re: [bug report] blktests srp/002 hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/19/23 23:22, Zhu Yanjun wrote:
> 
> 在 2023/9/20 2:11, Bob Pearson 写道:
>> On 9/19/23 03:07, Zhu Yanjun wrote:
>>> 在 2023/9/19 12:14, Shinichiro Kawasaki 写道:
>>>> On Sep 16, 2023 / 13:59, Zhu Yanjun wrote:
>>>> [...]
>>>>> On Debian, with the latest multipathd or revert the commit 9b4b7c1f9f54
>>>>> ("RDMA/rxe: Add workqueue support for rxe tasks"), this problem will
>>>>> disappear.
>>>> Zhu, thank you for the actions.
>>>>
>>>>> On Fedora 38, if the commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support
>>>>> for rxe tasks") is reverted, will this problem still appear?
>>>>> I do not have such test environment. The commit is in the attachment,
>>>>> can anyone have a test? Please let us know the test result. Thanks.
>>>> I tried the latest kernel tag v6.6-rc2 with my Fedora 38 test systems. With the
>>>> v6.6-rc2 kernel, I still see the hang. I repeated the blktests test case srp/002
>>>> 30 time or so, then the hang was recreated. Then I reverted the commit
>>>> 9b4b7c1f9f54 from v6.6-rc2, and the hang disappeared. I repeated the blktests
>>>> test case 100 times, and did not see the hang.
>>>>
>>>> I confirmed these results under two multipathd conditions: 1) with Fedora latest
>>>> device-mapper-multipath package v0.9.4, and 2) the latest multipath-tools v0.9.6
>>>> that I built from source code.
>>>>
>>>> So, when the commit gets reverted, the hang disappears as I reported for
>>>> v6.5-rcX kernels.
>>> Thanks, Shinichiro Kawasaki. Your helps are appreciated.
>>>
>>> This problem is related with the followings:
>>>
>>> 1). Linux distributions: Ubuntu, Debian and Fedora;
>>>
>>> 2). multipathd;
>>>
>>> 3). the commits 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks")
>>>
>>> On Ubuntu, with or without the commit, this problem does not occur.
>>>
>>> On Debian, without this commit, this problem does not occur. With this commit, this problem will occur.
>>>
>>> On Fedora, without this commit, this problem does not occur. With this commit, this problem will occur.
>>>
>>> The commits 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks") is from Bob Pearson.
>>>
>>> Hi, Bob, do you have any comments about this problem? It seems that this commit is not compatible with blktests.
>>>
>>> Hi, Jason and Leon, please comment on this problem.
>>>
>>> Thanks a lot.
>>>
>>> Zhu Yanjun
>> My belief is that the issue is related to timing not the logical operation of the code.
>> Work queues are just kernel processes and can be scheduled (if not holding spinlocks)
>> while soft IRQs lock up the CPU until they exit. This can cause longer delays in responding
>> to ULPs. The work queue tasks for each QP are strictly single threaded which is managed by
>> the work queue framework the same as tasklets.
> 
> Thanks, Bob. From you, the workqueue can be scheduled, this can cause longer delays in reponding to ULPs.
> 
> This will cause ULPs to hang. But the tasklet will lock up the CPU until it exits. So the tasklet will repond to
> 
> ULPs in time.
> 
> To this, there are 3 solutins:
> 
> 1). Try to make workqueue respond ULPs in time, this hang problem should be avoided. so this will not cause
> 
> this problem. But from the kernel, workqueue should be scheduled,So it is difficult to avoid this longer delay.
> 
> 
> 2). Make tasklet and workqueue both work in RXE.  We can make one of tasklet or workqueue as the default. The user
> 
> can choose to use tasklet or workqueue via kernel module parameter or sysctl variables. This will cost a lot of time
> 
> and efforts to implement it.
> 
> 
> 3). Revert the commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks"). Shinichiro Kawasaki
> 
> confirmed that this can fix this regression. And the patch is in the attachment.
> 
> 
> Hi, Bob, Please comment.
> 
> Hi, Jason && Leon, please also comment on this.
> 
> Thanks a lot.
> 
>>
>> Earlier in time I have also seen the exact same hang behavior with the siw driver but not
>> recently. Also I have seen sensitivity to logging changes in the hang behavior. These are
> 
> This is a regression to RXE which is caused by the the commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
> 
> We should fix it.
> 
> Zhu Yanjun
> 
>> indications that timing may be the cause of the issue.
>>
>> Bob

The verbs APIs do not make real time commitments. If a ULP fails because of response times it is the
problem in the ULP not in the verbs provider.

Bob



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux