Re: [PATCH for-rc] RDMA/cma: fix race between addr_handler and resolve_route

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 06, 2020 at 07:00:46PM +0200, Håkon Bugge wrote:
> 
> 
> > On 3 Apr 2020, at 21:36, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote:
> > 
> > On Fri, Apr 03, 2020 at 09:07:10PM +0200, Håkon Bugge wrote:
> >> 
> >> 
> >>> On 3 Apr 2020, at 20:57, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote:
> >>> 
> >>> On Fri, Apr 03, 2020 at 08:43:28PM +0200, Håkon Bugge wrote:
> >>>> A syzkaller test hits a NULL pointer dereference in
> >>>> rdma_resolve_route():
> >>> 
> >>> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
> >>> 
> >>> This commit in 5.7 probably fixes this:
> >> 
> >> I think it will not. The mutex in 7c11910783a1 ("RDMA/ucma: Put a
> >> lock around every call to the rdma_cm layer") will not prevent
> >> addr_handler() to run concurrently with rdma_resolve_route(), right?
> > 
> > Hmm. Perhaps so. But your patch isn't nearly enough if that is the
> > case, you've only considered resolve_route, but it could run
> > concurrently with *anything*, with the usual problems.
> 
> I was about to argue my case, but looking more into the code I tend
> to agree with you more and more. I thought, at least, that the
> rdma_foo_bar() functions had atomic checks on the state and bailed
> out if the state was inappropriate, but that is not the case all the
> time.
> 
> And the code that transitions the state from FOO to BAR, letting
> others observe BAR, and then conditionally setting it back to FOO
> isn't exactly great...

Ah, I did remember it right then :)

> > Probably the simplest answer is to have ucma fail operations that are
> > not permitted while an async_handler is pending. I'm guessing the only
> > operation that would be valid is rdma_destroy_id?
>
> Yes, that would take care of this class of errors, I agree. On a
> real systems (vs. the syzkaller setup), you may receive (almost any)
> CM packet ay any time, and that concurrency we must handle.

As I understand it, a correct application has to manage the RDMA CM
state machine, so if it has launched a background resolution then it
must complete/stop that before going on to do something else?

ie this is why the kernel ULPs theoretically don't need the extra
locking in ucm?

However, I'm not sure what the state machine is supposed to look
like..

> Shall I make a v2 base on next based on this idea, or do you have
> something coming?

Sure, I have nothing :)

Also that rdma_destroy_id in addr_handler looks wrong.. ie we still
retain pointers to the rdma_cm_id it destroys inside the struct
ucma_context, don't we?

> > Until that time we are taking a 'Big Lock' approach to all concurrancy
> > problems with rdma_cm as this code is *completely* broken for
> > concurrency.
> 
> Have you considered putting the hammer inside the cm_id instead of
> the context? Would that allow more concurrency and still avoid the
> 10ish syzkaller bugs?

It is, isn't it? Here the 'struct ucma_context *' is actually
per-cm_id, confusing right?

Jason



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux