On 10/4/23 12:44, Bart Van Assche wrote: > On 9/30/23 23:30, Leon Romanovsky wrote: >> On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote: >>> On 9/26/23 15:24, Bart Van Assche wrote: >>>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c >>>> index 1501120d4f52..6cd5d5a7a316 100644 >>>> --- a/drivers/infiniband/sw/rxe/rxe_task.c >>>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c >>>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; >>>> >>>> int rxe_alloc_wq(void) >>>> { >>>> - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); >>>> + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); >>>> if (!rxe_wq) >>>> return -ENOMEM; >>>> >>>> Thanks, >>>> >>>> Bart. >> >> <...> >> >>> Nevertheless this is a good hint since it seems to imply that there is a race between the requester and >>> completer which is certainly possible. >> >> Bob, Bart >> >> Can you please send this change as a formal patch? >> As we prefer workqueue with bad performance implementation over tasklets. > > Hi Bob, > > Do you perhaps have a preference for who posts the formal patch? > > Thanks, > > Bart. > Bart, Not really. I have spent the past two weeks chasing this bug and don't have much to report. I have never been able to reproduce your kasan bug. I have found like Zhu that the hang is always there but the frequency changes a lot depending on changes. For example various printk's can increase or decrease the frequency. I spent this morning looking at flame graphs captured during the hang which lasts about 60 seconds before it times out and check tears down the test. It is attached to this note. There seems to be a lot of recursion in what I assume is some attempt at error recovery. The recursion is probably in user space because the symbols are not available to perf. I would be worried that there may be stack overflow which could cause bad behavior. Bob
Attachment:
perf-kernel.svg
Description: image/svg