Re: [PATCHv2 1/1] RDMA/core: avoid kernel NULL pointer error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Probably this problem is caused by IB HW/FW. When IB device is set to down/up

for several times or IB HW/FW is bad, this similar prolem will appear.

In the future, when the developer confronts this similar problem, he can use

this patch to have a try.

Zhu Yanjun

On Mon, Nov 25, 2019 at 12:14 PM Zhu Yanjun <yanjun.zhu@xxxxxxxxxx> wrote:
>
> When the interface related with IB device is set to down/up over and
> over again, the following call trace will pop out.
> "
>  Call Trace:
>   [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
>   [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
>   [<ffffffff810a1ec0>] worker_thread+0x120/0x480
>   [<ffffffff810a709e>] kthread+0xce/0xf0
>   [<ffffffff816e9962>] ret_from_fork+0x42/0x70
>
>  RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
> "
> From vmcore, we can find the following:
> "
> crash7lates> struct ib_mad_list_head ffff881fb3713400
> struct ib_mad_list_head {
>   list = {
>     next = 0xffff881fb3713800,
>     prev = 0xffff881fe01395c0
>   },
>   mad_queue = 0x0
> }
> "
>
> Before the call trace, a lot of ib_cancel_mad is sent to the sender.
> So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
> "kernel NULL pointer" error.
>
> From the new customer report, when there is something wrong with IB HW/FW,
> the above call trace will appear. It seems that bad IB HW/FW will cause
> this problem.
>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
> ---
> V1->V2: Add new bug symptoms.
> ---
>  drivers/infiniband/core/mad.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 9947d16..43f596c 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -2279,6 +2279,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>                 return;
>         }
>
> +       if (unlikely(!mad_list->mad_queue)) {
> +               /*
> +                * When the interface related with IB device is set to down/up,
> +                * a lot of ib_cancel_mad packets are sent to the sender. In
> +                * sender, the mad packets are cancelled.  The receiver will
> +                * find mad_queue NULL. If the receiver does not test mad_queue,
> +                * the receiver will crash with "kernel NULL pointer" error.
> +                */
> +               return;
> +       }
> +
>         qp_info = mad_list->mad_queue->qp_info;
>         dequeue_mad(mad_list);
>
> --
> 2.7.4
>



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux