Re: [PATCH 1/1] RDMA/core: avoid kernel NULL pointer error

Håkon Bugge <haakon.bugge@xxxxxxxxxx> · Tue, 26 Feb 2019 13:32:09 +0100

> On 14 Feb 2019, at 01:03, Yanjun Zhu <yanjun.zhu@xxxxxxxxxx> wrote:
> 
> 
> On 2019/1/23 0:15, Jason Gunthorpe wrote:
>> On Tue, Jan 22, 2019 at 02:18:21AM -0500, Zhu Yanjun wrote:
>>> When the interface related with IB device is set to down/up over and
>>> over again, the following call trace will pop out.
>>> "
>>>  Call Trace:
>>>   [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
>>>   [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
>>>   [<ffffffff810a1ec0>] worker_thread+0x120/0x480
>>>   [<ffffffff810a709e>] kthread+0xce/0xf0
>>>   [<ffffffff816e9962>] ret_from_fork+0x42/0x70
>>> 
>>>  RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
>>> "
>>> From vmcore, we can find the following:
>>> "
>>> crash7lates> struct ib_mad_list_head ffff881fb3713400
>>> struct ib_mad_list_head {
>>>   list = {
>>>     next = 0xffff881fb3713800,
>>>     prev = 0xffff881fe01395c0
>>>   },
>>>   mad_queue = 0x0
>>> }
>>> "
>>> 
>>> Before the call trace, a lot of ib_cancel_mad is sent to the sender.
>>> So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
>>> "kernel NULL pointer" error.
>>> 
>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
>>>  drivers/infiniband/core/mad.c | 11 +++++++++++
>>>  1 file changed, 11 insertions(+)
>>> 
>>> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
>>> index 7870823bac47..ab5a7d1152ca 100644
>>> +++ b/drivers/infiniband/core/mad.c
>>> @@ -2250,6 +2250,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>>>  		return;
>>>  	}
>>>  +	if (unlikely(!mad_list->mad_queue)) {
>>> +		/*
>>> +		 * When the interface related with IB device is set to down/up,
>>> +		 * a lot of ib_cancel_mad packets are sent to the sender. In
>>> +		 * sender, the mad packets are cancelled.  The receiver will
>>> +		 * find mad_queue NULL. If the receiver does not test mad_queue,
>>> +		 * the receiver will crash with "kernel NULL pointer" error.
>>> +		 */
>> How does it become null here?
> 
> Hi, Jason
> 
> After upgrading IB switch to version 2.2.9-3, this problem disappears. It seems  that IB switch results in this problem.

Hi Yanjun,

I would like to rephrase: After changing the fw in the IB switch (where SA and OpenSM runs), we are not exposed to the bug any more.

This seems very much to be a bug. A kernel shall not crash - and that shall hold true independent of external conditions, such as fw versions in switches.

So, I have two requests for you:

1. Please see if the bug can be reproed with an upstream kernel.

2. (Assuming yes to the above), your commit doesn't fix the problem, it just diminishes it. If mad_list->mad_queue can become NULL asynchronously to ib_mad_recv_done() being called, it can become NULL just after you tested it to be non-NULL, right?

Thxs, Håkon

> 
> Thanks,
> 
> Zhu Yanjun
> 
>> 
>> Jason
>>