On Wed, Jan 23, 2019 at 11:34:00AM +0800, Yanjun Zhu wrote: > > On 2019/1/23 0:15, Jason Gunthorpe wrote: > > On Tue, Jan 22, 2019 at 02:18:21AM -0500, Zhu Yanjun wrote: > > > When the interface related with IB device is set to down/up over and > > > over again, the following call trace will pop out. > > > " > > > Call Trace: > > > [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad] > > > [<ffffffff810a1a41>] process_one_work+0x151/0x4b0 > > > [<ffffffff810a1ec0>] worker_thread+0x120/0x480 > > > [<ffffffff810a709e>] kthread+0xce/0xf0 > > > [<ffffffff816e9962>] ret_from_fork+0x42/0x70 > > > > > > RIP [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad] > > > " > > > From vmcore, we can find the following: > > > " > > > crash7lates> struct ib_mad_list_head ffff881fb3713400 > > > struct ib_mad_list_head { > > > list = { > > > next = 0xffff881fb3713800, > > > prev = 0xffff881fe01395c0 > > > }, > > > mad_queue = 0x0 > > > } > > > " > > > > > > Before the call trace, a lot of ib_cancel_mad is sent to the sender. > > > So it is necessary to check mad_queue in struct ib_mad_list_head to avoid > > > "kernel NULL pointer" error. > > > > > > Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx> > > > drivers/infiniband/core/mad.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > > index 7870823bac47..ab5a7d1152ca 100644 > > > +++ b/drivers/infiniband/core/mad.c > > > @@ -2250,6 +2250,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc) > > > return; > > > } > > > + if (unlikely(!mad_list->mad_queue)) { > > > + /* > > > + * When the interface related with IB device is set to down/up, > > > + * a lot of ib_cancel_mad packets are sent to the sender. In > > > + * sender, the mad packets are cancelled. The receiver will > > > + * find mad_queue NULL. If the receiver does not test mad_queue, > > > + * the receiver will crash with "kernel NULL pointer" error. > > > + */ > > How does it become null here? > When a lot of ib_cancel_mad packets are sent, from the source code, > ib_cancel_mad->ib_modify_mad, in ib_modify_mad, > > " > mad_send_wr->status = IB_WC_WR_FLUSH_ERR > " > Then these ib_cancel_mad packets are sent. > > The receiver receives IB_WC_WR_FLUSH_ERR, it will send it to IB device to > handle it. > > > So your problem "how mad_queue becomes NULL" should occur in IB device. > > IB firmware or HW makes mad_queue become NULL. It certainly doesn't Please find out why it is NULL and report back. Jason