> On 6 Apr 2020, at 19:31, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote: > > On Mon, Apr 06, 2020 at 07:00:46PM +0200, Håkon Bugge wrote: >> >> >>> On 3 Apr 2020, at 21:36, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote: >>> >>> On Fri, Apr 03, 2020 at 09:07:10PM +0200, Håkon Bugge wrote: >>>> >>>> >>>>> On 3 Apr 2020, at 20:57, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote: >>>>> >>>>> On Fri, Apr 03, 2020 at 08:43:28PM +0200, Håkon Bugge wrote: >>>>>> A syzkaller test hits a NULL pointer dereference in >>>>>> rdma_resolve_route(): >>>>> >>>>> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next >>>>> >>>>> This commit in 5.7 probably fixes this: >>>> >>>> I think it will not. The mutex in 7c11910783a1 ("RDMA/ucma: Put a >>>> lock around every call to the rdma_cm layer") will not prevent >>>> addr_handler() to run concurrently with rdma_resolve_route(), right? >>> >>> Hmm. Perhaps so. But your patch isn't nearly enough if that is the >>> case, you've only considered resolve_route, but it could run >>> concurrently with *anything*, with the usual problems. >> >> I was about to argue my case, but looking more into the code I tend >> to agree with you more and more. I thought, at least, that the >> rdma_foo_bar() functions had atomic checks on the state and bailed >> out if the state was inappropriate, but that is not the case all the >> time. >> >> And the code that transitions the state from FOO to BAR, letting >> others observe BAR, and then conditionally setting it back to FOO >> isn't exactly great... > > Ah, I did remember it right then :) > >>> Probably the simplest answer is to have ucma fail operations that are >>> not permitted while an async_handler is pending. I'm guessing the only >>> operation that would be valid is rdma_destroy_id? >> >> Yes, that would take care of this class of errors, I agree. On a >> real systems (vs. the syzkaller setup), you may receive (almost any) >> CM packet ay any time, and that concurrency we must handle. > > As I understand it, a correct application has to manage the RDMA CM > state machine, so if it has launched a background resolution then it > must complete/stop that before going on to do something else? > > ie this is why the kernel ULPs theoretically don't need the extra > locking in ucm? We expect kernel ULPs to behave as good citizens. For example, we expect them to get the ADDR_RESOLVED event _before_ calling resolve_route() (as opposed to this syzkaller repro). > However, I'm not sure what the state machine is supposed to look > like.. The neat state machines in Figure 13{5,6} in IBTA version 1.3. If you do not have it handy, I can send you the PDF off list. >> Shall I make a v2 base on next based on this idea, or do you have >> something coming? > > Sure, I have nothing :) > > Also that rdma_destroy_id in addr_handler looks wrong.. ie we still > retain pointers to the rdma_cm_id it destroys inside the struct > ucma_context, don't we? On entry from user-space, we use the u32 id and looks it up in the XArray. But if rdma_destoy_id() is called asynchronously called between ucma_get_ctx_dev() and the de-reference of ctx->cm_id, we are toast. >>> Until that time we are taking a 'Big Lock' approach to all concurrancy >>> problems with rdma_cm as this code is *completely* broken for >>> concurrency. >> >> Have you considered putting the hammer inside the cm_id instead of >> the context? Would that allow more concurrency and still avoid the >> 10ish syzkaller bugs? > > It is, isn't it? Here the 'struct ucma_context *' is actually > per-cm_id, confusing right? Must be, since it has a single "u32 id" ;-) Håkon