Re: [PATCH 1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/12/23 06:49, Zhu Yanjun wrote:
> 在 2023/10/12 7:12, Jason Gunthorpe 写道:
>> On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote:
>>> On 10/11/23 08:51, Jason Gunthorpe wrote:
>>>> If we revert it then rxe will probably just stop development
>>>> entirely. Daisuke's ODP work will be blocked and if Bob was able to
>>>> fix it he would have done so already. Which mean's Bobs ongoing work
>>>> is lost too.
>>>
>>> If Daisuke's work depends on the RXE changes then Daisuke may decide
>>> to help with the RXE changes.
>>>
>>> Introducing regressions while refactoring code is not acceptable.
>>
>> Generally, but I don't view rxe as a production part of the kernel so
>> I prefer to give time to resolve it.
>>
>>> I don't have enough spare time to help with the RXE driver.
> 
> commit 11ab7cc7ee32d6c3e16ac74c34c4bbdbf8f99292
> Author: Bart Van Assche <bvanassche@xxxxxxx>
> Date:   Tue Aug 22 09:57:07 2023 -0700
> 
>     Change the default RDMA driver from rdma_rxe to siw
> 
>     Since the siw driver is more stable than the rdma_rxe driver, change the
>     default into siw. See e.g.
> 
> https://lore.kernel.org/all/c3d1a966-b9b0-d015-38ec-86270b5045fc@xxxxxxx/.
> 
>     Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>
>     Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>
> 
> 
>>
>> Nor I
>>
>> Jason
> 
All,

I have spent the past several weeks working on trying to resolve this issue. The one thing I can say
for sure is that the failures or their rates are very sensitive to small timing changes. I totally agree
Jason that the bug has always been there and most of the suggested changes are just masking or unmasking
it. I have been running under all the kernel lock checking I can set and have not seen any warnings
so I doubt the error is a deadlock. My suspicion remains that the root cause of the hang is loss of
a completion or a timeout before a late completion leading to the transport state machine death. There
are surely other bugs in the driver and they may show up in parallel with this hang. I see the hang
consistently from 1-2% to 30-40% of the time when running srp/002 depending on various changes I have
tried but I have not been able to reproduce the KASAN bug yet. Because the hang is easy to reproduce
I have focused on that.

Bob



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux