Re: RDMA/addr: NULL dereference in process_one_req

Dan Aloni <dan@xxxxxxxxxxxx> · Wed, 23 Sep 2020 07:45:53 +0300

On Tue, Sep 22, 2020 at 02:09:51PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 06:13:48PM +0300, Dan Aloni wrote:
> > The Oops below [1], is quite rare, and occurs after awhile when kernel
> > code repeatedly tries to resolve addresses. According to my analysis the
> > work item is executed twice, and in the second time a NULL value of
> > `req->callback` triggers this Oops.
> 
> Hum I think the race is rdma_addr_cancel(), process_one_req() and 
> netevent_callback() running concurrently
> 
> It is very narrow but it looks like netevent_callback() could cause
> the work to become running such that rdma_addr_cancel() has already
> done the list_del_init() which causes the cancel_delayed_work() to be
> skipped, and the work re-run before rdma_addr_cancel() hits its
> cancel_work_sync()

Thanks for the quick response! This 3-CPU race has really been a head
scratcher.

> Please try this:
> 
> From fac94acc7a6fb4d78ddd06c51674110937442d15 Mon Sep 17 00:00:00 2001
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Date: Tue, 22 Sep 2020 13:54:17 -0300
> Subject: [PATCH] RDMA/addr: Fix race with
>  netevent_callback()/rdma_addr_cancel()

Looks good - I've ran this for 11 hours now and it's stable. I think it
solved the problem.

-- 
Dan Aloni