On Sun, Nov 24, 2019 at 11:24:35PM -0500, Zhu Yanjun wrote: > When the interface related with IB device is set to down/up over and > over again, the following call trace will pop out. > " > Call Trace: > [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad] > [<ffffffff810a1a41>] process_one_work+0x151/0x4b0 > [<ffffffff810a1ec0>] worker_thread+0x120/0x480 > [<ffffffff810a709e>] kthread+0xce/0xf0 > [<ffffffff816e9962>] ret_from_fork+0x42/0x70 > > RIP [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad] > " > From vmcore, we can find the following: > " > crash7lates> struct ib_mad_list_head ffff881fb3713400 > struct ib_mad_list_head { > list = { > next = 0xffff881fb3713800, > prev = 0xffff881fe01395c0 > }, > mad_queue = 0x0 > } > " > > Before the call trace, a lot of ib_cancel_mad is sent to the sender. > So it is necessary to check mad_queue in struct ib_mad_list_head to avoid > "kernel NULL pointer" error. > > From the new customer report, when there is something wrong with IB HW/FW, > the above call trace will appear. It seems that bad IB HW/FW will cause > this problem. > > Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx> > V1->V2: Add new bug symptoms. > drivers/infiniband/core/mad.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 9947d16..43f596c 100644 > +++ b/drivers/infiniband/core/mad.c > @@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc) > return; > } > > + if (unlikely(!mad_list->mad_queue)) { > + /* > + * When the interface related with IB device is set to down/up, > + * a lot of ib_cancel_mad packets are sent to the sender. In > + * sender, the mad packets are cancelled. The receiver will > + * find mad_queue NULL. If the receiver does not test mad_queue, > + * the receiver will crash with "kernel NULL pointer" error. > + */ > + return; > + } I feel like this patch was sent already? It is not possible for mad_queue to be NULL here without another bug, so this can't be the right fix. This is because: mad_priv->header.mad_list.mad_queue = recv_queue; mad_priv->header.mad_list.cqe.done = ib_mad_recv_done; recv_wr.wr_cqe = &mad_priv->header.mad_list.cqe; And then we do struct ib_mad_list_head *mad_list = container_of(wc->wr_cqe, struct ib_mad_list_head, cqe); So there is no point where the mad_list could be legimiately NULL'd before getting here, something else must be happening, you must figure out and describe how the NULL is happening. Jason