On 10/12/23 06:49, Zhu Yanjun wrote: > 在 2023/10/12 7:12, Jason Gunthorpe 写道: >> On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote: >>> On 10/11/23 08:51, Jason Gunthorpe wrote: >>>> If we revert it then rxe will probably just stop development >>>> entirely. Daisuke's ODP work will be blocked and if Bob was able to >>>> fix it he would have done so already. Which mean's Bobs ongoing work >>>> is lost too. >>> >>> If Daisuke's work depends on the RXE changes then Daisuke may decide >>> to help with the RXE changes. >>> >>> Introducing regressions while refactoring code is not acceptable. >> >> Generally, but I don't view rxe as a production part of the kernel so >> I prefer to give time to resolve it. >> >>> I don't have enough spare time to help with the RXE driver. > > commit 11ab7cc7ee32d6c3e16ac74c34c4bbdbf8f99292 > Author: Bart Van Assche <bvanassche@xxxxxxx> > Date: Tue Aug 22 09:57:07 2023 -0700 > > Change the default RDMA driver from rdma_rxe to siw > > Since the siw driver is more stable than the rdma_rxe driver, change the > default into siw. See e.g. > > https://lore.kernel.org/all/c3d1a966-b9b0-d015-38ec-86270b5045fc@xxxxxxx/. > > Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> > > >> >> Nor I >> >> Jason > All, I have spent the past several weeks working on trying to resolve this issue. The one thing I can say for sure is that the failures or their rates are very sensitive to small timing changes. I totally agree Jason that the bug has always been there and most of the suggested changes are just masking or unmasking it. I have been running under all the kernel lock checking I can set and have not seen any warnings so I doubt the error is a deadlock. My suspicion remains that the root cause of the hang is loss of a completion or a timeout before a late completion leading to the transport state machine death. There are surely other bugs in the driver and they may show up in parallel with this hang. I see the hang consistently from 1-2% to 30-40% of the time when running srp/002 depending on various changes I have tried but I have not been able to reproduce the KASAN bug yet. Because the hang is easy to reproduce I have focused on that. Bob