> On 14 Feb 2019, at 01:03, Yanjun Zhu <yanjun.zhu@xxxxxxxxxx> wrote: > > > On 2019/1/23 0:15, Jason Gunthorpe wrote: >> On Tue, Jan 22, 2019 at 02:18:21AM -0500, Zhu Yanjun wrote: >>> When the interface related with IB device is set to down/up over and >>> over again, the following call trace will pop out. >>> " >>> Call Trace: >>> [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad] >>> [<ffffffff810a1a41>] process_one_work+0x151/0x4b0 >>> [<ffffffff810a1ec0>] worker_thread+0x120/0x480 >>> [<ffffffff810a709e>] kthread+0xce/0xf0 >>> [<ffffffff816e9962>] ret_from_fork+0x42/0x70 >>> >>> RIP [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad] >>> " >>> From vmcore, we can find the following: >>> " >>> crash7lates> struct ib_mad_list_head ffff881fb3713400 >>> struct ib_mad_list_head { >>> list = { >>> next = 0xffff881fb3713800, >>> prev = 0xffff881fe01395c0 >>> }, >>> mad_queue = 0x0 >>> } >>> " >>> >>> Before the call trace, a lot of ib_cancel_mad is sent to the sender. >>> So it is necessary to check mad_queue in struct ib_mad_list_head to avoid >>> "kernel NULL pointer" error. >>> >>> Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx> >>> drivers/infiniband/core/mad.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> >>> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c >>> index 7870823bac47..ab5a7d1152ca 100644 >>> +++ b/drivers/infiniband/core/mad.c >>> @@ -2250,6 +2250,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc) >>> return; >>> } >>> + if (unlikely(!mad_list->mad_queue)) { >>> + /* >>> + * When the interface related with IB device is set to down/up, >>> + * a lot of ib_cancel_mad packets are sent to the sender. In >>> + * sender, the mad packets are cancelled. The receiver will >>> + * find mad_queue NULL. If the receiver does not test mad_queue, >>> + * the receiver will crash with "kernel NULL pointer" error. >>> + */ >> How does it become null here? > > Hi, Jason > > After upgrading IB switch to version 2.2.9-3, this problem disappears. It seems that IB switch results in this problem. Hi Yanjun, I would like to rephrase: After changing the fw in the IB switch (where SA and OpenSM runs), we are not exposed to the bug any more. This seems very much to be a bug. A kernel shall not crash - and that shall hold true independent of external conditions, such as fw versions in switches. So, I have two requests for you: 1. Please see if the bug can be reproed with an upstream kernel. 2. (Assuming yes to the above), your commit doesn't fix the problem, it just diminishes it. If mad_list->mad_queue can become NULL asynchronously to ib_mad_recv_done() being called, it can become NULL just after you tested it to be non-NULL, right? Thxs, Håkon > > Thanks, > > Zhu Yanjun > >> >> Jason >>