On Fri, 29 Jun 2012 14:06:56 -0700 (PDT) David Rientjes <rientjes@xxxxxxxxxx> wrote: > The global oom killer is serialized by the zonelist being used in the > page allocation. Brain hurts. Presumably this is referring to some lock within the zonelist. Clarify, please? > Concurrent oom kills are thus a rare event and only > occur in systems using mempolicies and with a large number of nodes. > > Memory controller oom kills, however, can frequently be concurrent since > there is no serialization once the oom killer is called for oom > conditions in several different memcgs in parallel. > > This creates a massive contention on tasklist_lock since the oom killer > requires the readside for the tasklist iteration. If several memcgs are > calling the oom killer, this lock can be held for a substantial amount of > time, especially if threads continue to enter it as other threads are > exiting. > > Since the exit path grabs the writeside of the lock with irqs disabled in > a few different places, this can cause a soft lockup on cpus as a result > of tasklist_lock starvation. > > The kernel lacks unfair writelocks, and successful calls to the oom > killer usually result in at least one thread entering the exit path, so > an alternative solution is needed. > > This patch introduces a seperate oom handler for memcgs so that they do > not require tasklist_lock for as much time. Instead, it iterates only > over the threads attached to the oom memcg and grabs a reference to the > selected thread before calling oom_kill_process() to ensure it doesn't > prematurely exit. > > This still requires tasklist_lock for the tasklist dump, iterating > children of the selected process, and killing all other threads on the > system sharing the same memory as the selected victim. So while this > isn't a complete solution to tasklist_lock starvation, it significantly > reduces the amount of time that it is held. > > > ... > > @@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > + int order) Perhaps have a comment over this function explaining why it exists? > +{ > + struct mem_cgroup *iter; > + unsigned long chosen_points = 0; > + unsigned long totalpages; > + unsigned int points = 0; > + struct task_struct *chosen = NULL; > + > + totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; > + for_each_mem_cgroup_tree(iter, memcg) { > + struct cgroup *cgroup = iter->css.cgroup; > + struct cgroup_iter it; > + struct task_struct *task; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + switch (oom_scan_process_thread(task, totalpages, NULL, > + false)) { > + case OOM_SCAN_SELECT: > > ... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>