On Thu, 16 Sept 2021 at 18:28, Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Thu, Sep 16, 2021 at 04:45:27PM +0200, Dmitry Vyukov wrote: > > > Answering your question re what was running concurrently with what. > > Each of the syscalls in these programs can run up to 2 times and > > ultimately any of these calls can race with any. Potentially syzkaller > > can predict values kernel will return (e.g. id's) before kernel > > actually returned them. I guess this does not restrict search area for > > the bug a lot... > > I have a reasonable theory now.. > > Based on the ops you provided this FSM sequence is possible > > RDMA_USER_CM_CMD_RESOLVE_IP > RDMA_CM_IDLE -> RDMA_CM_ADDR_QUERY > does rdma_resolve_ip(addr_handler) > > addr_handler > RDMA_CM_ADDR_QUERY -> RDMA_CM_ADDR_BOUND > [.. handler still running ..] > > RDMA_USER_CM_CMD_RESOLVE_IP > RDMA_CM_ADDR_BOUND -> RDMA_CM_ADDR_QUERY > does rdma_resolve_ip(addr_handler) > > RDMA_DESTROY_ID > rdma_addr_cancel() > > Which, if it happens fast enough, could trigger a situation where the > '&id_priv->id.route.addr.dev_addr' "handle" is in the req_list twice > beacause the addr_handler work queue hasn't yet got to the point of > deleting it from the req_list before the the 2nd one is added. > > The issue is rdma_addr_cancel() has to be called rdma_resolve_ip() can > be called again. > > Skipping it will cause 'req_list' to have two items in the internal > linked list with the same key and it will not cancel the newest one > with the active timer. This would cause the use after free syndrome > like this trace is showing. > > I can make a patch, but have no way to know if it is any good :\ Good detective work! But if you have a theory of what happens, it's usually easy to write a reproducer that aims at triggering this exact scenario. Isn't it the case here? I would assume you need it to add as a test anyway.