On Mon, Dec 27, 2010 at 06:18:01PM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 27 Dec 2010 03:42:57 -0500 > Ben Blum <bblum@xxxxxxxxxxxxxx> wrote: > > > On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote: > > > > Okay, then, problem is cpuset, which allocates memory. > > > (But I feel that limitation -never_sleep- is not very good.) > > > > > > BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem(). > > > Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to > > > be the same. I wonder we can see problem with some stupid userland. > > > > I'm not sure I understand the point of looking at mm->owner? My code > > does thread_group_leader() because if a task that's not the leader calls > > exec(), the thread_group list will be in an indeterminate state from the > > perspective of the task we hold on to (who was originally the leader). > > > > Hm. By following, only 'thread-group-leader' can go ahead. > > + threadgroup_fork_write_lock(leader); > + read_lock(&tasklist_lock); > + /* sanity check - if we raced with de_thread, we must abort */ > + if (!thread_group_leader(leader)) { > + retval = -EAGAIN; > + read_unlock(&tasklist_lock); > + threadgroup_fork_write_unlock(leader); > + goto list_teardown; > + } > > This moves each tasks under tasklist_lock(). > > + list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) { > + /* leave current thread as it is if it's already there */ > + oldcgrp = task_cgroup_from_root(tsk, root); > + if (cgrp == oldcgrp) > + continue; > + /* attach each task to each subsystem */ > + for_each_subsys(root, ss) { > + if (ss->attach_task) > + ss->attach_task(cgrp, tsk); > + } > + /* we don't care whether these threads are exiting */ > + retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true); > + BUG_ON(retval != 0 && retval != -ESRCH); > + } > > Hm. I'm not sure there can be a task, tsk == mm->owner here. > If tsk == mm->owner, attach task will take mmap_sem() and do something more. > > In usual case, mm->owner = thread_group_leader, but it's not in special cases. > > > > > > > Also, the tasklist_lock isn't used to synchronize fork events; that's > > > > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the > > > > thread-group list while iterating over it - so that a race with exec() > > > > (de_thread(), specifically) doesn't change the threadgroup state. It's > > > > basically never safe to iterate over leader->thread_group unless you're > > > > in a no-sleep section. > > > > > > > > > > Thank you for explanation. > > > > > > At first, cpuset should modify settings around 'mm' only when the thread == mm->owner. > > > ....is there any reason to allow all threads can affect the 'mm' ? > > > > I don't follow... is this any different from the cgroup_attach_task() > > case? > > > > cpuset_attach() takes mmap_sem() to modify mm's settings. Because all threads > share mm, doing page migration whenever a thread moves seems to be wrong. > I think only a thread, thread-group-leader or mm->owner, in process should be > allowed to migrate pages. Then, cgroup_attach_proc() can avoid taking mmap_sem > under tasklist_lock. Yes, but note: cpuset_migrate_mm isn't a per-thread operation already. By keeping it in cpuset_attach() (called once, not under tasklist_lock), instead of putting it in cpuset_attach_task() (called many times, under tasklist_lock), there is no problem. > > > About nodemask, I feel what's required is pre_pre_attach() to cache required memory > > > before attaching as radix_tree_preload(). Then, subsys can prepare for creating > > > working area. > > > > To save on global memory footprint, we could pre-allocate two nodemasks, > > but I feel like it's not worth the increase in code complexity. This > > would need to be done in the other cases that unsafely do NODEMASK_ALLOC > > too... too much to keep track of for little gain. > > > > But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems. Hence global allocation, instead of on-stack. (Also, for this particular case, the state of the nodemasks needs to persist from pre_attach to attach_task to attach, so it can't be static inside the function either.) > > > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock(). > > > > > > I may not understand anything important but I feel taking tasklist_lock() is overkill. > > > > Would rcu_read_lock() be any better? Everywhere else in the kernel that > > iterates over all threads in a group uses either rcu_read_lock or > > tasklist_lock. > > > > Because it means that can_attach()/attach() cannot sleep, it seems to make no > difference. can_attach_task(), pre_attach(), and attach_task() cannot sleep, but can_attach() and attach() may. Careful not to confuse them. :P > I wonder.... if you stops all clone()/fork() of the proc in moving, you can use > find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which > implements scanning tasklist with sleep. Can't you use next_tid() ? If I'm not mistaken, that approach is vulnerable to an exit() race - next_tid() may return NULL if pid_alive() fails, and then we stop iterating and miss some threads. (The relevant code is kernel/exit.c, __unhash_process, which is protected by tasklist_lock and sighand->lock, but nothing else.) I wonder if a judicious use of threadgroup_fork_lock would solve this, but I'm not sure where I could put it that would be safe. (When is signal_struct freed?) -- Ben > > Thanks, > -Kame > > P.S. I'll be offlined until 2010/01/04. > > _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers