Re: [PATCH 1/1] RDMA/core: avoid kernel NULL pointer error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 23, 2019 at 11:34:00AM +0800, Yanjun Zhu wrote:
> 
> On 2019/1/23 0:15, Jason Gunthorpe wrote:
> > On Tue, Jan 22, 2019 at 02:18:21AM -0500, Zhu Yanjun wrote:
> > > When the interface related with IB device is set to down/up over and
> > > over again, the following call trace will pop out.
> > > "
> > >   Call Trace:
> > >    [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
> > >    [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
> > >    [<ffffffff810a1ec0>] worker_thread+0x120/0x480
> > >    [<ffffffff810a709e>] kthread+0xce/0xf0
> > >    [<ffffffff816e9962>] ret_from_fork+0x42/0x70
> > > 
> > >   RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
> > > "
> > >  From vmcore, we can find the following:
> > > "
> > > crash7lates> struct ib_mad_list_head ffff881fb3713400
> > > struct ib_mad_list_head {
> > >    list = {
> > >      next = 0xffff881fb3713800,
> > >      prev = 0xffff881fe01395c0
> > >    },
> > >    mad_queue = 0x0
> > > }
> > > "
> > > 
> > > Before the call trace, a lot of ib_cancel_mad is sent to the sender.
> > > So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
> > > "kernel NULL pointer" error.
> > > 
> > > Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
> > >   drivers/infiniband/core/mad.c | 11 +++++++++++
> > >   1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> > > index 7870823bac47..ab5a7d1152ca 100644
> > > +++ b/drivers/infiniband/core/mad.c
> > > @@ -2250,6 +2250,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
> > >   		return;
> > >   	}
> > > +	if (unlikely(!mad_list->mad_queue)) {
> > > +		/*
> > > +		 * When the interface related with IB device is set to down/up,
> > > +		 * a lot of ib_cancel_mad packets are sent to the sender. In
> > > +		 * sender, the mad packets are cancelled.  The receiver will
> > > +		 * find mad_queue NULL. If the receiver does not test mad_queue,
> > > +		 * the receiver will crash with "kernel NULL pointer" error.
> > > +		 */
> > How does it become null here?
> When a lot of ib_cancel_mad packets are sent, from the source code,
> ib_cancel_mad->ib_modify_mad, in ib_modify_mad,
> 
> "
> mad_send_wr->status = IB_WC_WR_FLUSH_ERR
> "
> Then these ib_cancel_mad packets are sent.
> 
> The receiver receives IB_WC_WR_FLUSH_ERR, it will send it to IB device to
> handle it.
> 
> 
> So your problem "how mad_queue becomes NULL" should occur in IB device.
> 
> IB firmware or HW makes mad_queue become NULL.

It certainly doesn't

Please find out why it is NULL and report back.

Jason



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux