On Tue, Sep 22, 2020 at 06:13:48PM +0300, Dan Aloni wrote: > The Oops below [1], is quite rare, and occurs after awhile when kernel > code repeatedly tries to resolve addresses. According to my analysis the > work item is executed twice, and in the second time a NULL value of > `req->callback` triggers this Oops. Hum I think the race is rdma_addr_cancel(), process_one_req() and netevent_callback() running concurrently It is very narrow but it looks like netevent_callback() could cause the work to become running such that rdma_addr_cancel() has already done the list_del_init() which causes the cancel_delayed_work() to be skipped, and the work re-run before rdma_addr_cancel() hits its cancel_work_sync() Please try this: >From fac94acc7a6fb4d78ddd06c51674110937442d15 Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe <jgg@xxxxxxxxxx> Date: Tue, 22 Sep 2020 13:54:17 -0300 Subject: [PATCH] RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel() This three thread race can result in the work being run once the callback becomes NULL: CPU1 CPU2 CPU3 netevent_callback() process_one_req() rdma_addr_cancel() [..] spin_lock_bh() set_timeout() spin_unlock_bh() spin_lock_bh() list_del_init(&req->list); spin_unlock_bh() req->callback = NULL spin_lock_bh() if (!list_empty(&req->list)) // Skipped! // cancel_delayed_work(&req->work); spin_unlock_bh() process_one_req() // again req->callback() // BOOM cancel_delayed_work_sync() The solution is to always cancel the work once it is completed so any inbetween set_timeout() does not result in it running again. Fixes: 44e75052bc2a ("RDMA/rdma_cm: Make rdma_addr_cancel into a fence") Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> --- drivers/infiniband/core/addr.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 3a98439bba832b..0abce004a9591e 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -647,13 +647,12 @@ static void process_one_req(struct work_struct *_work) req->callback = NULL; spin_lock_bh(&lock); + /* + * Although the work will normally have been canceled by the workqueue, + * it can still be requeued as long as it is on the req_list. + */ + cancel_delayed_work(&req->work); if (!list_empty(&req->list)) { - /* - * Although the work will normally have been canceled by the - * workqueue, it can still be requeued as long as it is on the - * req_list. - */ - cancel_delayed_work(&req->work); list_del_init(&req->list); kfree(req); } -- 2.28.0