On 2019/1/23 0:15, Jason Gunthorpe wrote:
On Tue, Jan 22, 2019 at 02:18:21AM -0500, Zhu Yanjun wrote:
When the interface related with IB device is set to down/up over and
over again, the following call trace will pop out.
"
Call Trace:
[<ffffffffa039ff8d>] ib_mad_completion_handler+0x7d/0xa0 [ib_mad]
[<ffffffff810a1a41>] process_one_work+0x151/0x4b0
[<ffffffff810a1ec0>] worker_thread+0x120/0x480
[<ffffffff810a709e>] kthread+0xce/0xf0
[<ffffffff816e9962>] ret_from_fork+0x42/0x70
RIP [<ffffffffa039f926>] ib_mad_recv_done_handler+0x26/0x610 [ib_mad]
"
From vmcore, we can find the following:
"
crash7lates> struct ib_mad_list_head ffff881fb3713400
struct ib_mad_list_head {
list = {
next = 0xffff881fb3713800,
prev = 0xffff881fe01395c0
},
mad_queue = 0x0
}
"
Before the call trace, a lot of ib_cancel_mad is sent to the sender.
So it is necessary to check mad_queue in struct ib_mad_list_head to avoid
"kernel NULL pointer" error.
Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
drivers/infiniband/core/mad.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 7870823bac47..ab5a7d1152ca 100644
+++ b/drivers/infiniband/core/mad.c
@@ -2250,6 +2250,17 @@ static void ib_mad_recv_done(struct ib_cq *cq, struct ib_wc *wc)
return;
}
+ if (unlikely(!mad_list->mad_queue)) {
+ /*
+ * When the interface related with IB device is set to down/up,
+ * a lot of ib_cancel_mad packets are sent to the sender. In
+ * sender, the mad packets are cancelled. The receiver will
+ * find mad_queue NULL. If the receiver does not test mad_queue,
+ * the receiver will crash with "kernel NULL pointer" error.
+ */
How does it become null here?
When a lot of ib_cancel_mad packets are sent, from the source code,
ib_cancel_mad->ib_modify_mad, in ib_modify_mad,
"
mad_send_wr->status = IB_WC_WR_FLUSH_ERR
"
Then these ib_cancel_mad packets are sent.
The receiver receives IB_WC_WR_FLUSH_ERR, it will send it to IB device
to handle it.
So your problem "how mad_queue becomes NULL" should occur in IB device.
IB firmware or HW makes mad_queue become NULL.
In drivers, some error handlers should handle this to avoid kernel crash.
If you need ibstat or lspci information, please let me know.
Zhu Yanjun
Jason