Hi All,
We will work with Redhat for final go.
For now this patch is on hold and not urgent.
Leon,
Hold this discussion for now.
Kashyap
On Fri, 22 Nov 2024, 18:54 Mohammad Heib, <mheib@xxxxxxxxxx> wrote:
On Sat, Nov 16, 2024 at 01:33:13PM +0530, Selvin Xavier wrote:
> On Thu, Nov 14, 2024 at 5:15 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >
> > On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote:
> > > On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote:
> > > > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it
> > > > > can send completions for old cookies that have already been handled by the
> > > > > bnxt driver. If that old cookie was associated with an old calling context
> > > > > the driver will try to access that caller memory again because the driver
> > > > > never clean the is_waiter_alive flag after the caller successfully complete
> > > > > waiting, and this access will cause the following kernel panic:
> > > > >
> > > > > Call Trace:
> > > > > <IRQ>
> > > > > ? __die+0x20/0x70
> > > > > ? page_fault_oops+0x75/0x170
> > > > > ? exc_page_fault+0xaa/0x140
> > > > > ? asm_exc_page_fault+0x22/0x30
> > > > > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re]
> > > > > ? srso_return_thunk+0x5/0x5f
> > > > > ? __wake_up_common+0x78/0xa0
> > > > > ? srso_return_thunk+0x5/0x5f
> > > > > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re]
> > > > > tasklet_action_common+0xac/0x210
> > > > > handle_softirqs+0xd3/0x2b0
> > > > > __irq_exit_rcu+0x9b/0xc0
> > > > > common_interrupt+0x7f/0xa0
> > > > > </IRQ>
> > > > > <TASK>
> > > > >
> > > > > To avoid the above unexpected behavior clear the is_waiter_alive flag
> > > > > every time the caller finishes waiting for a completion.
> Mohammad,
> We were trying to see the possibility. FW shouldn't be giving an old
> cookie. One possibility
> could be if FW crashes and we are in the recovery routine.
> Adding this check is okay, but may be hiding some other error.
> Is it possible to share your test scripts to repro this problem? Also,
> can you share
> the vmcore-demsg also
>
> Thanks
> Selvin
>
I have sent you all the needed data in a separate email.
Thanks,
>
> > > > >
> > > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
> > > > > Signed-off-by: Mohammad Heib <mheib@xxxxxxxxxx>
> > > > > ---
> > > > > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++--------
> > > > > 1 file changed, 8 insertions(+), 8 deletions(-)
> > > >
> > > > Selvin?
> > > Someone is confirming the fix. Will ack in a day. Thanks
> >
> > Thanks
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature