On Tue, Sep 22, 2020 at 02:09:51PM -0300, Jason Gunthorpe wrote: > On Tue, Sep 22, 2020 at 06:13:48PM +0300, Dan Aloni wrote: > > The Oops below [1], is quite rare, and occurs after awhile when kernel > > code repeatedly tries to resolve addresses. According to my analysis the > > work item is executed twice, and in the second time a NULL value of > > `req->callback` triggers this Oops. > > Hum I think the race is rdma_addr_cancel(), process_one_req() and > netevent_callback() running concurrently > > It is very narrow but it looks like netevent_callback() could cause > the work to become running such that rdma_addr_cancel() has already > done the list_del_init() which causes the cancel_delayed_work() to be > skipped, and the work re-run before rdma_addr_cancel() hits its > cancel_work_sync() Thanks for the quick response! This 3-CPU race has really been a head scratcher. > Please try this: > > From fac94acc7a6fb4d78ddd06c51674110937442d15 Mon Sep 17 00:00:00 2001 > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Date: Tue, 22 Sep 2020 13:54:17 -0300 > Subject: [PATCH] RDMA/addr: Fix race with > netevent_callback()/rdma_addr_cancel() Looks good - I've ran this for 11 hours now and it's stable. I think it solved the problem. -- Dan Aloni