Re: [PATCH 1/1] RDMA/core: avoid kernel NULL pointer error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2019/1/23 12:58, Jason Gunthorpe wrote:
On Wed, Jan 23, 2019 at 11:34:00AM +0800, Yanjun Zhu wrote:
On 2019/1/23 0:15, Jason Gunthorpe wrote:
On Tue, Jan 22, 2019 at 02:18:21AM -0500, Zhu Yanjun wrote:
When the interface related with IB device is set to down/up over and
over again, the following call trace will pop out.
"
   Call Trace:
    [<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
    [<ffffffff810a1a41>] process_one_work+0x151/0x4b0
    [<ffffffff810a1ec0>] worker_thread+0x120/0x480
    [<ffffffff810a709e>] kthread+0xce/0xf0
    [<ffffffff816e9962>] ret_from_fork+0x42/0x70

   RIP  [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
"
  From vmcore, we can find the following:
"
crash7lates> struct ib_mad_list_head ffff881fb3713400
struct ib_mad_list_head {
    list = {
      next = 0xffff881fb3713800,
      prev = 0xffff881fe01395c0
    },
    mad_queue = 0x0
}
"

Before the call trace, a lot of ib_cancel_mad is sent to the sender.
So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
"kernel NULL pointer" error.

Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
   drivers/infiniband/core/mad.c | 11 +++++++++++
   1 file changed, 11 insertions(+)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 7870823bac47..ab5a7d1152ca 100644
+++ b/drivers/infiniband/core/mad.c
@@ -2250,6 +2250,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
   		return;
   	}
+	if (unlikely(!mad_list->mad_queue)) {
+		/*
+		 * When the interface related with IB device is set to down/up,
+		 * a lot of ib_cancel_mad packets are sent to the sender. In
+		 * sender, the mad packets are cancelled.  The receiver will
+		 * find mad_queue NULL. If the receiver does not test mad_queue,
+		 * the receiver will crash with "kernel NULL pointer" error.
+		 */
How does it become null here?
When a lot of ib_cancel_mad packets are sent, from the source code,
ib_cancel_mad->ib_modify_mad, in ib_modify_mad,

"
mad_send_wr->status = IB_WC_WR_FLUSH_ERR
"
Then these ib_cancel_mad packets are sent.

The receiver receives IB_WC_WR_FLUSH_ERR, it will send it to IB device to
handle it.


So your problem "how mad_queue becomes NULL" should occur in IB device.

IB firmware or HW makes mad_queue become NULL.
It certainly doesn't
From the source code,

int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr)
{
        struct ib_mad_qp_info *qp_info;
        struct list_head *list;
        struct ib_send_wr *bad_send_wr;
        struct ib_mad_agent *mad_agent;
        struct ib_sge *sge;
        unsigned long flags;
        int ret;

        /* Set WR ID to find mad_send_wr upon completion */
        qp_info = mad_send_wr->mad_agent_priv->qp_info;
        mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list;
        mad_send_wr->mad_list.mad_queue = &qp_info->send_queue;                    <----mad_queue should be the send_queue in sender.

        mad_agent = mad_send_wr->send_buf.mad_agent;
        sge = mad_send_wr->sg_list;
        sge[0].addr = ib_dma_map_single(mad_agent->device,
mad_send_wr->send_buf.mad,
                                        sge[0].length,
                                        DMA_TO_DEVICE);

struct ib_mad_qp_info {
        struct ib_mad_port_private *port_priv;
        struct ib_qp *qp;
        struct ib_mad_queue send_queue;    <---send_queue is not a pointer.
        struct ib_mad_queue recv_queue;
        struct list_head overflow_list;
        spinlock_t snoop_lock;
        struct ib_mad_snoop_private **snoop_table;
        int snoop_table_size;
        atomic_t snoop_count;
};

only in static int ib_mad_port_close(struct ib_device *device, int port_num)

This qp_info will be destroyed.

ib_mad_port_close is only called in mad_client remove.

From the source code,

static void __exit ib_mad_cleanup_module(void)
{
        ib_unregister_client(&mad_client);  <---This will remove mad_client.
        kmem_cache_destroy(ib_mad_cache);
}

module_init(ib_mad_init_module);
module_exit(ib_mad_cleanup_module);
This only occurs when mad modules are removed from the kernel.

But from vmcore, the mad functions can be found in it. So mad module is still in it.

ib_mad_cleanup_module is not executed. So qp_info->send_queue should not be NULL.

From the driver, mad_queue should not be set to NULL.

This mad_queue is sent to IB device. And IB_WC_WR_FLUSH_ERR is also sent to IB device.

So it is very possible that mad_queue is set to NULL in IB device.

If I am missing something, please feel free to let me know.

Zhu Yanjun

Please find out why it is NULL and report back.

Jason




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux