Re: [PATCH v5 3/3] cgroups: make procs file writable

KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> · Mon, 27 Dec 2010 18:18:01 +0900

On Mon, 27 Dec 2010 03:42:57 -0500
Ben Blum <bblum@xxxxxxxxxxxxxx> wrote:

> On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote:

> > Okay, then, problem is cpuset, which allocates memory.
> > (But I feel that limitation -never_sleep- is not very good.)
> > 
> > BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem().
> > Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to
> > be the same. I wonder we can see problem with some stupid userland.
> 
> I'm not sure I understand the point of looking at mm->owner? My code
> does thread_group_leader() because if a task that's not the leader calls
> exec(), the thread_group list will be in an indeterminate state from the
> perspective of the task we hold on to (who was originally the leader).
> 

Hm. By following, only 'thread-group-leader' can go ahead.

+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}

This moves each tasks under tasklist_lock().

+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}

Hm. I'm not sure there can be a task, tsk == mm->owner here.
If tsk == mm->owner, attach task will take mmap_sem() and do something more.

In usual case, mm->owner = thread_group_leader, but it's not in special cases.

> > > Also, the tasklist_lock isn't used to synchronize fork events; that's
> > > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
> > > thread-group list while iterating over it - so that a race with exec()
> > > (de_thread(), specifically) doesn't change the threadgroup state. It's
> > > basically never safe to iterate over leader->thread_group unless you're
> > > in a no-sleep section.
> > > 
> > 
> > Thank you for explanation.
> > 
> > At first, cpuset should modify settings around 'mm' only when the thread == mm->owner.
> > ....is there any reason to allow all threads can affect the 'mm' ?
> 
> I don't follow... is this any different from the cgroup_attach_task()
> case?
> 

cpuset_attach() takes mmap_sem() to modify mm's settings. Because all threads
share mm, doing page migration whenever a thread moves seems to be wrong.
I think only a thread, thread-group-leader or mm->owner, in process should be
allowed to migrate pages. Then, cgroup_attach_proc() can avoid taking mmap_sem
under tasklist_lock.

> > About nodemask, I feel what's required is pre_pre_attach() to cache required memory
> > before attaching as radix_tree_preload(). Then, subsys can prepare for creating
> > working area.
> 
> To save on global memory footprint, we could pre-allocate two nodemasks,
> but I feel like it's not worth the increase in code complexity. This
> would need to be done in the other cases that unsafely do NODEMASK_ALLOC
> too... too much to keep track of for little gain.
> 

But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems.

> > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().
> > 
> > I may not understand anything important but I feel taking tasklist_lock() is overkill.
> 
> Would rcu_read_lock() be any better? Everywhere else in the kernel that
> iterates over all threads in a group uses either rcu_read_lock or
> tasklist_lock.
> 

Because it means that can_attach()/attach() cannot sleep, it seems to make no
difference.

I wonder.... if you stops all clone()/fork() of the proc in moving, you can use
find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which
implements  scanning tasklist with sleep. Can't you use next_tid() ?

Thanks,
-Kame

P.S. I'll be offlined until 2010/01/04.

_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers