On Tue, 10 Jul 2012, Andrew Morton wrote: > > The global oom killer is serialized by the zonelist being used in the > > page allocation. > > Brain hurts. Presumably this is referring to some lock within the > zonelist. Clarify, please? > Yeah, it's done with try_set_zonelist_oom() before calling the oom killer; it sets the ZONE_OOM_LOCKED bit for each zone in the zonelist to avoid concurrent oom kills for the same zonelist, otherwise it's possible to overkill. > > Concurrent oom kills are thus a rare event and only > > occur in systems using mempolicies and with a large number of nodes. > > > > Memory controller oom kills, however, can frequently be concurrent since > > there is no serialization once the oom killer is called for oom > > conditions in several different memcgs in parallel. > > > > This creates a massive contention on tasklist_lock since the oom killer > > requires the readside for the tasklist iteration. If several memcgs are > > calling the oom killer, this lock can be held for a substantial amount of > > time, especially if threads continue to enter it as other threads are > > exiting. > > > > Since the exit path grabs the writeside of the lock with irqs disabled in > > a few different places, this can cause a soft lockup on cpus as a result > > of tasklist_lock starvation. > > > > The kernel lacks unfair writelocks, and successful calls to the oom > > killer usually result in at least one thread entering the exit path, so > > an alternative solution is needed. > > > > This patch introduces a seperate oom handler for memcgs so that they do > > not require tasklist_lock for as much time. Instead, it iterates only > > over the threads attached to the oom memcg and grabs a reference to the > > selected thread before calling oom_kill_process() to ensure it doesn't > > prematurely exit. > > > > This still requires tasklist_lock for the tasklist dump, iterating > > children of the selected process, and killing all other threads on the > > system sharing the same memory as the selected victim. So while this > > isn't a complete solution to tasklist_lock starvation, it significantly > > reduces the amount of time that it is held. > > > > > > ... > > > > @@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > > return min(limit, memsw); > > } > > > > +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > + int order) > > Perhaps have a comment over this function explaining why it exists? > It's removed in the last patch in the series, but I can add a comment to explain why we need to kill a task when a memcg reaches its limit to the new mem_cgroup_out_of_memory() if you'd like. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>