On Fri, 19 Feb 2010, Nick Piggin wrote: > > guarantee_online_cpus() truly does require callback_mutex, the > > cgroup_scan_tasks() iterator locking can protect changes in the cgroup > > hierarchy but it doesn't protect a store to cs->cpus_allowed or for > > hotplug. > > Right, but the callback_mutex was being removed by this patch. > I was making the case for it to be readded :) > > top_cpuset.cpus_allowed will always need to track cpu_active_map since > > those are the schedulable cpus, it looks like that's initialized for SMP > > and the cpu hotplug notifier does that correctly. > > > > I'm not sure what the logic is doing in cpuset_attach() where cs is the > > cpuset to attach to: > > > > if (cs == &top_cpuset) { > > cpumask_copy(cpus_attach, cpu_possible_mask); > > to = node_possible_map; > > } > > > > cpus_attach is properly protected by cgroup_lock, but using > > node_possible_map here will set task->mems_allowed to node_possible_map > > when the cpuset does not have memory_migrate enabled. This is the source > > of your oops, I think. > > Could be, yes. > I'd be interested to see if you still get the same oops with the patch at the end of this email that fixes this logic. > But it doesn't matter if stores are done under lock, if the loads are > not. masks can be multiple words, so there isn't any ordering between > reading half and old mask and half a new one that results in an invalid > state. AFAIKS. > It doesn't matter for MAX_NUMNODES > BITS_PER_LONG because task->mems_alllowed only gets updated via cpuset_change_task_nodemask() where the added nodes are set and then the removed nodes are cleared. The side effect of this lockless access to task->mems_allowed means we may have a small race between nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); and tsk->mems_allowed = *newmems; but the penalty is that we get an allocation on a removed node, which isn't a big deal, especially since it was previously allowed. > Well it is exported as cpuset_lock(). And the scheduler has it covered > in all cases by the looks except for select_task_rq, which is called > by wakeup code. We should stick WARN_ONs through the cpuset code for > mutexes not held when they should be. > A lot of the reliance on callback_mutex was removed because the strict hierarchy walking and task membership is now guarded by cgroup_mutex instead. Some of the comments in kernel/cpuset.c weren't updated so they still say callback_mutex when in reality they mean cgroup_mutex. --- diff --git a/kernel/cpuset.c b/kernel/cpuset.c --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1319,7 +1319,7 @@ static int fmeter_getrate(struct fmeter *fmp) return val; } -/* Protected by cgroup_lock */ +/* Protected by cgroup_mutex held on cpuset_attach() */ static cpumask_var_t cpus_attach; /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */ @@ -1390,8 +1390,12 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont, struct cpuset *oldcs = cgroup_cs(oldcont); if (cs == &top_cpuset) { - cpumask_copy(cpus_attach, cpu_possible_mask); - to = node_possible_map; + /* + * top_cpuset.cpus_allowed and top_cpuset.mems_allowed are + * protected by cgroup_lock which is already held here. + */ + cpumask_copy(cpus_attach, top_cpuset.cpus_allowed); + to = top_cpuset.mems_allowed; } else { guarantee_online_cpus(cs, cpus_attach); guarantee_online_mems(cs, &to); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>