Re: [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 10 Jul 2012 16:24:29 -0700 (PDT)

On Tue, 10 Jul 2012, Andrew Morton wrote:

> > The global oom killer is serialized by the zonelist being used in the
> > page allocation.
> 
> Brain hurts.  Presumably this is referring to some lock within the
> zonelist.  Clarify, please?
> 

Yeah, it's done with try_set_zonelist_oom() before calling the oom killer; 
it sets the ZONE_OOM_LOCKED bit for each zone in the zonelist to avoid 
concurrent oom kills for the same zonelist, otherwise it's possible to 
overkill.

> >  Concurrent oom kills are thus a rare event and only
> > occur in systems using mempolicies and with a large number of nodes.
> > 
> > Memory controller oom kills, however, can frequently be concurrent since
> > there is no serialization once the oom killer is called for oom
> > conditions in several different memcgs in parallel.
> > 
> > This creates a massive contention on tasklist_lock since the oom killer
> > requires the readside for the tasklist iteration.  If several memcgs are
> > calling the oom killer, this lock can be held for a substantial amount of
> > time, especially if threads continue to enter it as other threads are
> > exiting.
> > 
> > Since the exit path grabs the writeside of the lock with irqs disabled in
> > a few different places, this can cause a soft lockup on cpus as a result
> > of tasklist_lock starvation.
> > 
> > The kernel lacks unfair writelocks, and successful calls to the oom
> > killer usually result in at least one thread entering the exit path, so
> > an alternative solution is needed.
> > 
> > This patch introduces a seperate oom handler for memcgs so that they do
> > not require tasklist_lock for as much time.  Instead, it iterates only
> > over the threads attached to the oom memcg and grabs a reference to the
> > selected thread before calling oom_kill_process() to ensure it doesn't
> > prematurely exit.
> > 
> > This still requires tasklist_lock for the tasklist dump, iterating
> > children of the selected process, and killing all other threads on the
> > system sharing the same memory as the selected victim.  So while this
> > isn't a complete solution to tasklist_lock starvation, it significantly
> > reduces the amount of time that it is held.
> > 
> >
> > ...
> >
> > @@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> >  	return min(limit, memsw);
> >  }
> >  
> > +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > +				int order)
> 
> Perhaps have a comment over this function explaining why it exists?
> 

It's removed in the last patch in the series, but I can add a comment to 
explain why we need to kill a task when a memcg reaches its limit to the 
new mem_cgroup_out_of_memory() if you'd like.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>