On Mon, 27 Dec 2010 03:42:57 -0500 Ben Blum <bblum@xxxxxxxxxxxxxx> wrote: > On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote: > > Okay, then, problem is cpuset, which allocates memory. > > (But I feel that limitation -never_sleep- is not very good.) > > > > BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem(). > > Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to > > be the same. I wonder we can see problem with some stupid userland. > > I'm not sure I understand the point of looking at mm->owner? My code > does thread_group_leader() because if a task that's not the leader calls > exec(), the thread_group list will be in an indeterminate state from the > perspective of the task we hold on to (who was originally the leader). > Hm. By following, only 'thread-group-leader' can go ahead. + threadgroup_fork_write_lock(leader); + read_lock(&tasklist_lock); + /* sanity check - if we raced with de_thread, we must abort */ + if (!thread_group_leader(leader)) { + retval = -EAGAIN; + read_unlock(&tasklist_lock); + threadgroup_fork_write_unlock(leader); + goto list_teardown; + } This moves each tasks under tasklist_lock(). + list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) { + /* leave current thread as it is if it's already there */ + oldcgrp = task_cgroup_from_root(tsk, root); + if (cgrp == oldcgrp) + continue; + /* attach each task to each subsystem */ + for_each_subsys(root, ss) { + if (ss->attach_task) + ss->attach_task(cgrp, tsk); + } + /* we don't care whether these threads are exiting */ + retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true); + BUG_ON(retval != 0 && retval != -ESRCH); + } Hm. I'm not sure there can be a task, tsk == mm->owner here. If tsk == mm->owner, attach task will take mmap_sem() and do something more. In usual case, mm->owner = thread_group_leader, but it's not in special cases. > > > Also, the tasklist_lock isn't used to synchronize fork events; that's > > > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the > > > thread-group list while iterating over it - so that a race with exec() > > > (de_thread(), specifically) doesn't change the threadgroup state. It's > > > basically never safe to iterate over leader->thread_group unless you're > > > in a no-sleep section. > > > > > > > Thank you for explanation. > > > > At first, cpuset should modify settings around 'mm' only when the thread == mm->owner. > > ....is there any reason to allow all threads can affect the 'mm' ? > > I don't follow... is this any different from the cgroup_attach_task() > case? > cpuset_attach() takes mmap_sem() to modify mm's settings. Because all threads share mm, doing page migration whenever a thread moves seems to be wrong. I think only a thread, thread-group-leader or mm->owner, in process should be allowed to migrate pages. Then, cgroup_attach_proc() can avoid taking mmap_sem under tasklist_lock. > > About nodemask, I feel what's required is pre_pre_attach() to cache required memory > > before attaching as radix_tree_preload(). Then, subsys can prepare for creating > > working area. > > To save on global memory footprint, we could pre-allocate two nodemasks, > but I feel like it's not worth the increase in code complexity. This > would need to be done in the other cases that unsafely do NODEMASK_ALLOC > too... too much to keep track of for little gain. > But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems. > > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock(). > > > > I may not understand anything important but I feel taking tasklist_lock() is overkill. > > Would rcu_read_lock() be any better? Everywhere else in the kernel that > iterates over all threads in a group uses either rcu_read_lock or > tasklist_lock. > Because it means that can_attach()/attach() cannot sleep, it seems to make no difference. I wonder.... if you stops all clone()/fork() of the proc in moving, you can use find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which implements scanning tasklist with sleep. Can't you use next_tid() ? Thanks, -Kame P.S. I'll be offlined until 2010/01/04. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers