When the interface related with IB device is set to down/up over and over again, the following call trace will pop out. " Call Trace: [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad] [<ffffffff810a1a41>] process_one_work+0x151/0x4b0 [<ffffffff810a1ec0>] worker_thread+0x120/0x480 [<ffffffff810a709e>] kthread+0xce/0xf0 [<ffffffff816e9962>] ret_from_fork+0x42/0x70 RIP [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad] " >From vmcore, we can find the following: " crash7lates> struct ib_mad_list_head ffff881fb3713400 struct ib_mad_list_head { list = { next = 0xffff881fb3713800, prev = 0xffff881fe01395c0 }, mad_queue = 0x0 } " Before the call trace, a lot of ib_cancel_mad is sent to the sender. So it is necessary to check mad_queue in struct ib_mad_list_head to avoid "kernel NULL pointer" error. >From the new customer report, when there is something wrong with IB HW/FW, the above call trace will appear. It seems that bad IB HW/FW will cause this problem. Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx> --- V1->V2: Add new bug symptoms. --- drivers/infiniband/core/mad.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 9947d16..43f596c 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc) return; } + if (unlikely(!mad_list->mad_queue)) { + /* + * When the interface related with IB device is set to down/up, + * a lot of ib_cancel_mad packets are sent to the sender. In + * sender, the mad packets are cancelled. The receiver will + * find mad_queue NULL. If the receiver does not test mad_queue, + * the receiver will crash with "kernel NULL pointer" error. + */ + return; + } + qp_info = mad_list->mad_queue->qp_info; dequeue_mad(mad_list); -- 2.7.4