On Thu, Jun 14, 2018 at 04:14:13PM -0700, Cong Wang wrote: > On Thu, Jun 14, 2018 at 10:24 AM, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote: > > On Thu, Jun 14, 2018 at 10:03:09AM -0700, Cong Wang wrote: > >> On Thu, Jun 14, 2018 at 7:24 AM, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote: > >> > > >> > This was my brief reaction too, this code path almost certainly has a > >> > use-after-free, and we should fix the concurrency between the two > >> > places in some correct way.. > >> > >> First of all, why use-after-free could trigger an imbalance unlock? > >> IOW, why do we have to solve use-after-free to fix this imbalance > >> unlock? > > > > The issue syzkaller hit is that accessing ctx->file does not seem > > locked in any way and can race with other manipulations of ctx->file. > > > > So.. for this patch to be correct we need to understand how this > > statement: > > > > f = ctx->file > > > > Avoids f becoming a dangling pointer - and without locking, my > > It doesn't, because this is not the point, this is not the cause > of the unlock imbalance either. syzbot didn't report use-after-free > or a kernel segfault here. No, it *is* the point - you've proposed a solution, one of many, and we need to see an actual sensible design for how the locking around ctx->file should work correctly. We need solutions that solve the underlying problem, not just paper over the symptoms. Stated another way, for a syzkaller report like this there are a few really obvious fixes. 1) Capture the lock pointer on the stack: f = ctx->file mutex_lock(&f->mut); mutex_unlock(&f->mut); 2) Prevent ctx->file from changing, eg add more locking: mutex_lock(&mut); mutex_lock(&ctx->file->mut); mutex_unlock(&ctx->file->mut)); mutex_unlock(&mut); 3) Prevent ctx->file from being changing/freed by flushing the WQ at the right times: rdma_addr_cancel(...); ctx->file = XYZ; This patch proposed #1. An explanation is required why that is a correct locking design for this code. It sure looks like it isn't. Looking at this *just a bit*, I wonder why not do something like this: mutex_lock(&mut); f = ctx->file; mutex_lock(&f->mutex); mutex_unlock(&mut); ? At least that *might* make sense. Though probably it deadlocks as it looks like we call rdma_addr_cancel() while holding mut. Yuk. But maybe that sequence could be done before launching the work.. > > I'm not sure that race exists, there should be something that flushes > > the WQ on the path to close... (though I have another email that > > perhaps that is broken, sigh) > > This is not related to my patch, but to convince you, let me explain: > > struct ucma_file is not refcnt'ed, I know you cancel the work in > rdma_destroy_id(), but after ucma_migrate_id() the ctx has already > been moved to the new file, for the old file, it won't cancel the > ctx flying with workqueue. So, I think the following use-after-free > could happen: > > ucma_event_handler(): > cur_file = ctx->file; // old file > > ucma_migrate_id(): > lock(); > list_move_tail(&ctx->list, &new_file->ctx_list); > ctx->file = new_file; > unlock(); > > ucma_close(): > // retrieve old file via filp->private_data > // the loop won't cover the ctx moved to the new_file > kfree(file); Yep. That sure seems like the right analysis! > This is _not_ the cause of the unlock imbalance, and is _not_ expected > to solve by patch either. What do you mean? Not calling rdma_addr_cancel() prior to freeing the file is *exactly* the cause of the lock imbalance. The design of this code *assumes* that rdma_addr_cancel() will be called before altering/freeing/etc any of the state it is working on, migration makes a state change that violates that invariant. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html