Re: [PATCH v5 3/3] cgroups: make procs file writable

Ben Blum <bblum@xxxxxxxxxxxxxx> · Mon, 27 Dec 2010 05:12:28 -0500

On Mon, Dec 27, 2010 at 06:18:01PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 27 Dec 2010 03:42:57 -0500
> Ben Blum <bblum@xxxxxxxxxxxxxx> wrote:
> 
> > On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> > > Okay, then, problem is cpuset, which allocates memory.
> > > (But I feel that limitation -never_sleep- is not very good.)
> > > 
> > > BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem().
> > > Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to
> > > be the same. I wonder we can see problem with some stupid userland.
> > 
> > I'm not sure I understand the point of looking at mm->owner? My code
> > does thread_group_leader() because if a task that's not the leader calls
> > exec(), the thread_group list will be in an indeterminate state from the
> > perspective of the task we hold on to (who was originally the leader).
> > 
> 
> Hm. By following, only 'thread-group-leader' can go ahead.
> 
> +	threadgroup_fork_write_lock(leader);
> +	read_lock(&tasklist_lock);
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		read_unlock(&tasklist_lock);
> +		threadgroup_fork_write_unlock(leader);
> +		goto list_teardown;
> +	}
> 
> This moves each tasks under tasklist_lock().
> 
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* attach each task to each subsystem */
> +		for_each_subsys(root, ss) {
> +			if (ss->attach_task)
> +				ss->attach_task(cgrp, tsk);
> +		}
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> 
> Hm. I'm not sure there can be a task, tsk == mm->owner here.
> If tsk == mm->owner, attach task will take mmap_sem() and do something more.
> 
> In usual case, mm->owner = thread_group_leader, but it's not in special cases.
> 
> 
> 
> > > > Also, the tasklist_lock isn't used to synchronize fork events; that's
> > > > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
> > > > thread-group list while iterating over it - so that a race with exec()
> > > > (de_thread(), specifically) doesn't change the threadgroup state. It's
> > > > basically never safe to iterate over leader->thread_group unless you're
> > > > in a no-sleep section.
> > > > 
> > > 
> > > Thank you for explanation.
> > > 
> > > At first, cpuset should modify settings around 'mm' only when the thread == mm->owner.
> > > ....is there any reason to allow all threads can affect the 'mm' ?
> > 
> > I don't follow... is this any different from the cgroup_attach_task()
> > case?
> > 
> 
> cpuset_attach() takes mmap_sem() to modify mm's settings. Because all threads
> share mm, doing page migration whenever a thread moves seems to be wrong.
> I think only a thread, thread-group-leader or mm->owner, in process should be
> allowed to migrate pages. Then, cgroup_attach_proc() can avoid taking mmap_sem
> under tasklist_lock.

Yes, but note: cpuset_migrate_mm isn't a per-thread operation already.
By keeping it in cpuset_attach() (called once, not under tasklist_lock),
instead of putting it in cpuset_attach_task() (called many times, under
tasklist_lock), there is no problem.

> > > About nodemask, I feel what's required is pre_pre_attach() to cache required memory
> > > before attaching as radix_tree_preload(). Then, subsys can prepare for creating
> > > working area.
> > 
> > To save on global memory footprint, we could pre-allocate two nodemasks,
> > but I feel like it's not worth the increase in code complexity. This
> > would need to be done in the other cases that unsafely do NODEMASK_ALLOC
> > too... too much to keep track of for little gain.
> > 
> 
> But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems.

Hence global allocation, instead of on-stack. (Also, for this particular
case, the state of the nodemasks needs to persist from pre_attach to
attach_task to attach, so it can't be static inside the function
either.)

> > > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().
> > > 
> > > I may not understand anything important but I feel taking tasklist_lock() is overkill.
> > 
> > Would rcu_read_lock() be any better? Everywhere else in the kernel that
> > iterates over all threads in a group uses either rcu_read_lock or
> > tasklist_lock.
> > 
> 
> Because it means that can_attach()/attach() cannot sleep, it seems to make no
> difference.

can_attach_task(), pre_attach(), and attach_task() cannot sleep, but
can_attach() and attach() may. Careful not to confuse them. :P

> I wonder.... if you stops all clone()/fork() of the proc in moving, you can use
> find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which
> implements  scanning tasklist with sleep. Can't you use next_tid() ?

If I'm not mistaken, that approach is vulnerable to an exit() race -
next_tid() may return NULL if pid_alive() fails, and then we stop
iterating and miss some threads. (The relevant code is kernel/exit.c,
__unhash_process, which is protected by tasklist_lock and sighand->lock,
but nothing else.)

I wonder if a judicious use of threadgroup_fork_lock would solve this,
but I'm not sure where I could put it that would be safe. (When is
signal_struct freed?)

-- Ben

> 
> Thanks,
> -Kame
> 
> P.S. I'll be offlined until 2010/01/04.
> 
> 
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers