On Sat, Nov 16, 2024 at 01:33:13PM +0530, Selvin Xavier wrote: > On Thu, Nov 14, 2024 at 5:15 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > > > On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote: > > > On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > > > > > > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote: > > > > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it > > > > > can send completions for old cookies that have already been handled by the > > > > > bnxt driver. If that old cookie was associated with an old calling context > > > > > the driver will try to access that caller memory again because the driver > > > > > never clean the is_waiter_alive flag after the caller successfully complete > > > > > waiting, and this access will cause the following kernel panic: > > > > > > > > > > Call Trace: > > > > > <IRQ> > > > > > ? __die+0x20/0x70 > > > > > ? page_fault_oops+0x75/0x170 > > > > > ? exc_page_fault+0xaa/0x140 > > > > > ? asm_exc_page_fault+0x22/0x30 > > > > > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re] > > > > > ? srso_return_thunk+0x5/0x5f > > > > > ? __wake_up_common+0x78/0xa0 > > > > > ? srso_return_thunk+0x5/0x5f > > > > > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re] > > > > > tasklet_action_common+0xac/0x210 > > > > > handle_softirqs+0xd3/0x2b0 > > > > > __irq_exit_rcu+0x9b/0xc0 > > > > > common_interrupt+0x7f/0xa0 > > > > > </IRQ> > > > > > <TASK> > > > > > > > > > > To avoid the above unexpected behavior clear the is_waiter_alive flag > > > > > every time the caller finishes waiting for a completion. > Mohammad, > We were trying to see the possibility. FW shouldn't be giving an old > cookie. One possibility > could be if FW crashes and we are in the recovery routine. > Adding this check is okay, but may be hiding some other error. > Is it possible to share your test scripts to repro this problem? Also, > can you share > the vmcore-demsg also > > Thanks > Selvin > I have sent you all the needed data in a separate email. Thanks, > > > > > > > > > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout") > > > > > Signed-off-by: Mohammad Heib <mheib@xxxxxxxxxx> > > > > > --- > > > > > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++-------- > > > > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > > > > > Selvin? > > > Someone is confirming the fix. Will ack in a day. Thanks > > > > Thanks