On Sun, 26 Dec 2010 23:22:54 -0500 Ben Blum <bblum@xxxxxxxxxxxxxx> wrote: > On Mon, Dec 27, 2010 at 09:53:53AM +0900, Daisuke Nishimura wrote: > > On Fri, 24 Dec 2010 21:55:08 -0500 > > Ben Blum <bblum@xxxxxxxxxxxxxx> wrote: > > > > > On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote: > > > > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote: > > > > > Patches have gone a bit stale, sorry. Refactoring in > > > > > kernel/cgroup_freezer.c necessitates a refresh and retest please. > > > > > > > > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff > > > > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which > > > > wraps kmalloc) instead of on the stack. > > > > > > > > 1. All these functions have 'void' return values, indicating that > > > > calling them them must not fail. Sure there are bailout cases, but no > > > > semblance of cross-function error propagation. Most importantly, > > > > cpuset_attach is a subsystem callback, which MUST not fail given the > > > > way it's used in cgroups, so relying on kmalloc is not safe. > > > > > > > > 2. I'm working on a patch series which needs to hold tasklist_lock > > > > across ss->attach callbacks (in cpuset_attach's "if (threadgroup)" > > > > case, this is how safe traversal of tsk->thread_group will be > > > > ensured), and kmalloc isn't allowed while holding a spin-lock. > > > > > > > > Why do we need heap-allocation here at all? In each case their scope is > > > > exactly the function's scope, and neither the commit nor the surrounding > > > > patch series give any explanation. I'd like to revert the patch if > > > > possible. > > > > > > > > cc'ing Miao Xie (author) and David Rientjes (acker). > > > > > > > > -- Ben > > > > > > Well even with the proposed solution to this there is still another > > > problem that I see - that of mmap_sem. cpuset_attach() calls into > > > mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for > > > writing and reading respectively. This is going to conflict with > > > tasklist_lock... but moreover, the memcontrol subsys also touches the > > > task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and > > > mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b. > > > > > > So we have (currently, even without my patches): > > > > > > cgroup_attach_task > > > (1) cpuset_can_attach > > > (2) mem_cgroup_can_attach > > > - down_read(&mm->mmap_sem); > > > (3) cpuset_attach > > > - mpol_rebind_mm > > > - down_write(&mm->mmap_sem); > > > - up_write(&mm->mmap_sem); > > > - cpuset_migrate_mm > > > - do_migrate_pages > > > - down_read(&mm->mmap_sem); > > > - up_read(&mm->mmap_sem); > > > (4) mem_cgroup_move_task > > > - mem_cgroup_clear_mc > > > - up_read(...); > > > > > hmm, nice catch. > > > > > Is there some interdependency I'm missing here that guarantees recursive > > > locking/deadlock will be avoided? It all looks like typical-case code. > > > > > Unfortunately, not. > > I couldn't hit this because I mount all subsystems onto different > > mount points in my environment. > > > > > I think we should move taking the mmap_sem all the way up into > > > cgroup_attach_task and cgroup_attach_proc; it will be held for writing > > > the whole time. I don't quite understand the mempolicy stuff but maybe > > > there can be ways to use mpol_rebind_mm and do_migrate_pages when the > > > lock is already held. > > > > > I agree. > > Another solution(just an idea): avoid enabling both "move_charge" feature of memcg > > and "memory_migrate" of cpuset at the same time iff they are mounted > > under the same mount point. But, hmm... it's not a good idea to make a subsystem > > take account of another subsystem, IMHO. > > > > > > Thanks, > > Daisuke Nishimura. > > It looks to me like when memcg holds the mmap_sem the whole time, it's > just to avoid the deadlock, not that there's there some need for the > stuff under mmap_sem not to change between can_attach and attach. But if > there is such a need, then the write-side in mpol_rebind_mm may conflict > even with my proposed solution. > > Regardless, the best way would be to avoid holding the mmap_sem across > the whole window, possibly by solving the move_charge deadlock some > other internal way, if at all possible? > IIUC, what you request is 'don't call any kind of function may sleep'. in can_attach() and attach() callback. That's impossible. Memory cgroup will go to sleep because of calling memory reclaim. Is tasklist_lock the only way ? Can you catch "new thread" event in cgroup_fork() ? For example, == void cgroup_fork(struct task_struct *child) { if (current->in_moving) { add_to_waitqueue somewhere. } task_lock(current); child->cgroups = current->cgroups; get_css_set(child->cgroups); task_unlock(current); INIT_LIST_HEAD(&child->cg_list); } == == read_lock(&tasklist_lock); list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) { tsk->in_moving = true; } read_unlock(&tasklist_lock); ->pre_attach() ->attach() read_unlock(&tasklist_lock); list_for_each_..... tsk->in_moving_cgroup = false; wakeup threads in waitq. == Ah yes, this will have some bug. But please avoid to take tasklist_lock() around all pre_attach() and attach(). Thanks, -Kame Thanks, -Kame _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers