On 9/19/23 23:22, Zhu Yanjun wrote: > > 在 2023/9/20 2:11, Bob Pearson 写道: >> On 9/19/23 03:07, Zhu Yanjun wrote: >>> 在 2023/9/19 12:14, Shinichiro Kawasaki 写道: >>>> On Sep 16, 2023 / 13:59, Zhu Yanjun wrote: >>>> [...] >>>>> On Debian, with the latest multipathd or revert the commit 9b4b7c1f9f54 >>>>> ("RDMA/rxe: Add workqueue support for rxe tasks"), this problem will >>>>> disappear. >>>> Zhu, thank you for the actions. >>>> >>>>> On Fedora 38, if the commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support >>>>> for rxe tasks") is reverted, will this problem still appear? >>>>> I do not have such test environment. The commit is in the attachment, >>>>> can anyone have a test? Please let us know the test result. Thanks. >>>> I tried the latest kernel tag v6.6-rc2 with my Fedora 38 test systems. With the >>>> v6.6-rc2 kernel, I still see the hang. I repeated the blktests test case srp/002 >>>> 30 time or so, then the hang was recreated. Then I reverted the commit >>>> 9b4b7c1f9f54 from v6.6-rc2, and the hang disappeared. I repeated the blktests >>>> test case 100 times, and did not see the hang. >>>> >>>> I confirmed these results under two multipathd conditions: 1) with Fedora latest >>>> device-mapper-multipath package v0.9.4, and 2) the latest multipath-tools v0.9.6 >>>> that I built from source code. >>>> >>>> So, when the commit gets reverted, the hang disappears as I reported for >>>> v6.5-rcX kernels. >>> Thanks, Shinichiro Kawasaki. Your helps are appreciated. >>> >>> This problem is related with the followings: >>> >>> 1). Linux distributions: Ubuntu, Debian and Fedora; >>> >>> 2). multipathd; >>> >>> 3). the commits 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks") >>> >>> On Ubuntu, with or without the commit, this problem does not occur. >>> >>> On Debian, without this commit, this problem does not occur. With this commit, this problem will occur. >>> >>> On Fedora, without this commit, this problem does not occur. With this commit, this problem will occur. >>> >>> The commits 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks") is from Bob Pearson. >>> >>> Hi, Bob, do you have any comments about this problem? It seems that this commit is not compatible with blktests. >>> >>> Hi, Jason and Leon, please comment on this problem. >>> >>> Thanks a lot. >>> >>> Zhu Yanjun >> My belief is that the issue is related to timing not the logical operation of the code. >> Work queues are just kernel processes and can be scheduled (if not holding spinlocks) >> while soft IRQs lock up the CPU until they exit. This can cause longer delays in responding >> to ULPs. The work queue tasks for each QP are strictly single threaded which is managed by >> the work queue framework the same as tasklets. > > Thanks, Bob. From you, the workqueue can be scheduled, this can cause longer delays in reponding to ULPs. > > This will cause ULPs to hang. But the tasklet will lock up the CPU until it exits. So the tasklet will repond to > > ULPs in time. > > To this, there are 3 solutins: > > 1). Try to make workqueue respond ULPs in time, this hang problem should be avoided. so this will not cause > > this problem. But from the kernel, workqueue should be scheduled,So it is difficult to avoid this longer delay. > > > 2). Make tasklet and workqueue both work in RXE. We can make one of tasklet or workqueue as the default. The user > > can choose to use tasklet or workqueue via kernel module parameter or sysctl variables. This will cost a lot of time > > and efforts to implement it. > > > 3). Revert the commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks"). Shinichiro Kawasaki > > confirmed that this can fix this regression. And the patch is in the attachment. > > > Hi, Bob, Please comment. > > Hi, Jason && Leon, please also comment on this. > > Thanks a lot. > >> >> Earlier in time I have also seen the exact same hang behavior with the siw driver but not >> recently. Also I have seen sensitivity to logging changes in the hang behavior. These are > > This is a regression to RXE which is caused by the the commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks"). > > We should fix it. > > Zhu Yanjun > >> indications that timing may be the cause of the issue. >> >> Bob The verbs APIs do not make real time commitments. If a ULP fails because of response times it is the problem in the ULP not in the verbs provider. Bob