Re: [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Nov 24, 2019 at 11:24:35PM -0500, Zhu Yanjun wrote:
> When the interface related with IB device is set to down/up over and
> over again, the following call trace will pop out.
> "
>  Call Trace:
>   [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
>   [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
>   [<ffffffff810a1ec0>] worker_thread+0x120/0x480
>   [<ffffffff810a709e>] kthread+0xce/0xf0
>   [<ffffffff816e9962>] ret_from_fork+0x42/0x70
> 
>  RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
> "
> From vmcore, we can find the following:
> "
> crash7lates> struct ib_mad_list_head ffff881fb3713400
> struct ib_mad_list_head {
>   list = {
>     next = 0xffff881fb3713800,
>     prev = 0xffff881fe01395c0
>   },
>   mad_queue = 0x0
> }
> "
> 
> Before the call trace, a lot of ib_cancel_mad is sent to the sender.
> So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
> "kernel NULL pointer" error.
> 
> From the new customer report, when there is something wrong with IB HW/FW,
> the above call trace will appear. It seems that bad IB HW/FW will cause
> this problem.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
> V1->V2: Add new bug symptoms.
>  drivers/infiniband/core/mad.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 9947d16..43f596c 100644
> +++ b/drivers/infiniband/core/mad.c
> @@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>  		return;
>  	}
>  
> +	if (unlikely(!mad_list->mad_queue)) {
> +		/*
> +		 * When the interface related with IB device is set to down/up,
> +		 * a lot of ib_cancel_mad packets are sent to the sender. In
> +		 * sender, the mad packets are cancelled.  The receiver will
> +		 * find mad_queue NULL. If the receiver does not test mad_queue,
> +		 * the receiver will crash with "kernel NULL pointer" error.
> +		 */
> +		return;
> +	}

I feel like this patch was sent already? 

It is not possible for mad_queue to be NULL here without another bug,
so this can't be the right fix.

This is because:

		mad_priv->header.mad_list.mad_queue = recv_queue;
		mad_priv->header.mad_list.cqe.done = ib_mad_recv_done;
		recv_wr.wr_cqe = &mad_priv->header.mad_list.cqe;

And then we do

	struct ib_mad_list_head *mad_list =
		container_of(wc->wr_cqe, struct ib_mad_list_head, cqe);

So there is no point where the mad_list could be legimiately NULL'd
before getting here, something else must be happening, you must figure
out and describe how the NULL is happening.

Jason



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux