Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails

Michal Hocko <mhocko@xxxxxxxxxx> · Wed, 27 Mar 2019 09:49:12 +0100

On Tue 26-03-19 20:16:16, Andrea Arcangeli wrote:
> On Tue, Mar 26, 2019 at 09:56:43AM +0100, Michal Hocko wrote:
> > On Mon 25-03-19 18:56:35, Andrea Arcangeli wrote:
> > > MEMCG depends on the task structure not to be freed under
> > > rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
> > > mm->owner.
> > 
> > Please state the actual problem. Your cover letter mentiones a race
> > condition. Please make it explicit in the changelog.
> 
> The actual problem is the task structure is freed while
> get_mem_cgroup_from_mm() holds rcu_read_lock() and dereferences
> mm->owner.
> 
> I thought the breakage of RCU is pretty clear, but we could add a
> description of the race like I did in the original thread:
> 
> https://lkml.kernel.org/r/000000000000601367057a095de4@xxxxxxxxxx
> https://lkml.kernel.org/r/20190316194222.GA29767@xxxxxxxxxx

Yes please. That really belongs to the changelog. You do not expect
people chasing long email threads or code to figure that out, right?

> > > An alternate possible fix would be to defer the delivery of the
> > > userfaultfd contexts to the monitor until after fork() is guaranteed
> > > to succeed. Such a change would require more changes because it would
> > > create a strict ordering dependency where the uffd methods would need
> > > to be called beyond the last potentially failing branch in order to be
> > > safe.
> > 
> > How much more changes are we talking about? Because ...
> 
> I haven't implemented but I can theorize. It should require a new
> hooking point and information being accumulated in RAM and passed from
> the current hooking point to the new hooking point and to hold off the
> delivery of such information to the uffd monitor (the fd reader),
> until the new hooking point is invoked. The new hooking point would
> need to be invoked after fork cannot fail anymore.
> 
> We already accumulate some information in RAM there, but the first
> delivery happens at a point where fork can still fail.

I am sorry but this is not really clear to me. What is the problem to
postpone hooking point to later and how much more data we are talking
about here?

> > > This solution as opposed only adds the dependency to common code
> > > to set mm->owner to NULL and to free the task struct that was pointed
> > > by mm->owner with RCU, if fork ends up failing. The userfaultfd
> > > methods can still be called anywhere during the fork runtime and the
> > > monitor will keep discarding orphaned "mm" coming from failed forks in
> > > userland.
> > 
> > ... this is adding a subtle hack that might break in the future because
> > copy_process error paths are far from trivial and quite error prone
> > IMHO. I am not opposed to the patch in principle but I would really like
> > to see what kind of solutions we are comparing here.
> 
> The rule of clearing mm->owner and then freeing the mm->owner memory
> with call_rcu is already followed everywhere else. See for example
> mm_update_next_owner() that sets mm->owner to NULL and only then
> invokes put_task_struct which frees the memory pointed by the old
> value of mm->owner using RCU.
>
> The "subtle hack" already happens at every exit when MEMCG=y. All the
> patch does is to extend the "subtle hack" to the fork failure path too
> which it didn't follow the rule and it didn't clear mm->owner and it
> just freed the task struct without waiting for a RCU grace period. In
> fact like pointed out by Kirill Tkhai we could reuse
> delayed_put_task_struct method that is already used by exit, except it
> does more than freeing the task structure and it relies on refcounters
> to be initialized so I thought the free_task -> call_rcu( free_task)
> conversion was simpler and more obviously safe. Sharing the other
> method only looked a complication that requires syncing up the
> refcounts.
> 
> I think the only conceptual simplification possible would be again to
> add a new hooking point and more buildup of information until fork
> cannot fail, but in implementation terms I doubt the fix will become
> smaller or simpler that way.

Well, in general I prefer the code to be memcg neutral as much as
possible. We might have this subtle dependency with memcg now but this
is not specific to memcg in general. Therefore, if there is a way to
make a userfault specific fix then I would prefer it. If that is not
feasible then fair enough.

JFYI, getting rid of mm->owner is a long term plan. This is just too
ugly to live. Easier said than done, unfortunately.

> > > This race condition couldn't trigger if CONFIG_MEMCG was set =n at
> > > build time.
> > 
> > All the CONFIG_MEMCG is just ugly as hell. Can we reduce that please?
> > E.g. use if (IS_ENABLED(CONFIG_MEMCG)) where appropriate?
> 
> There's just one place where I could use that instead of #ifdef.

OK, I can see it now. Is there any strong reason to make the delayed
freeing conditional that would spare at least part of the ugliness.

> > > +static __always_inline void mm_clear_owner(struct mm_struct *mm,
> > > +					   struct task_struct *p)
> > > +{
> > > +#ifdef CONFIG_MEMCG
> > > +	if (mm->owner == p)
> > > +		WRITE_ONCE(mm->owner, NULL);
> > > +#endif
> > 
> > How can we ever hit this warning and what does that mean?
> 
> Which warning?

A brain fart, I would have sworn that I've seen WARN_ON_ONCE. Sorry
about the confusion.

-- 
Michal Hocko
SUSE Labs