Re: [PATCH rdma-next] RDMA/odp: Fix use-after-free bug in releasing ucontext

Jason Gunthorpe <jgg@xxxxxxxxxxxx> · Wed, 5 Sep 2018 17:21:37 -0600

On Tue, Sep 04, 2018 at 03:12:28PM +0300, Leon Romanovsky wrote:
> After releasing ucontext the __mmu_notifier_release will be called
> again in exit_mmap path. However at that time the driver ucontext
> (mlx5_ib_ucontext) already will be freed and it will cause
> to use-after-free error, due to improper use of mmu_notifier API.

I can't see how this fix is right, it needs more explanation if it is.

The way it is supposed to work is that the umem *CANNOT* outlive the
ucontext.

When the umem is destroyed it deletes the mmu_notifier, and
mmu_notifier_unregister is a hard fence (it calls hlist_del_init_rcu
then synchronize_srcu).  

So once mmu_notifier_unregister() completes the notifier will not be
seen by __mmu_notifier_release..

If all umems are destroyed then the hlist inside context-mn will be
empty and __mmu_notifier_release will not touch ucontext anymore..

So, to me, this bug looks like either a umem struct survived the
ucontext, which is blatantly wrong and needs to be fixed directly..

Or, perhaps, the umem did this flow:

		owning_mm = get_task_mm(owning_process);
		if (owning_mm == NULL)
			/*
			 * The process' mm is already dead, notifier were
			 * removed already.
			 */
			goto out_put_task;
		mmu_notifier_unregister(&context->mn, owning_mm);

ie owning_mm == NULL, which skipped the unregister.. Leaving a
dangling hlist reference in the mm if it hasn't already been cleaned
up by something else. 

Since owning_mm == NULL very much does not imply that something else
cleaned up the notifier, this is clearly buggy code, and the symptom
would exactly match your trace.

Having looked at it somewhat I think the fix here is to get rid of the
completely wrong and stupid 'owning_process' thing *completely*.

Instead hold on to the actual mm directly inside the umme via mmgrab()
and mmdrop(), just like mmu_notifiers already does internally.

Same thing for the other wrong code using context->tgid.

Jason