Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 15 Mar 2012 14:47:50 -0700 (PDT)

On Thu, 15 Mar 2012, Mel Gorman wrote:

> Where is all the memory going?  A brief look at elf_core_dump() looks
> fairly innocent.
> 
> o kmalloc for a header note
> o kmalloc potentially for a short header
> o dump_write() verifies access and calls f_op->write. I guess this could
>   be doing a lot of allocations underneath, is this where all the memory
>   is going?

Yup, this is the one.  We only currently see this when a memcg is at its 
limit and there are other threads that are trying to exit that are blocked 
on a coredumper that can no longer get memory.  dump_write() calling 
->write() (ext4 in this case) causes a livelock when 
add_to_page_cache_locked() tries to charge the soon-to-be-added pagecache 
to the coredumper's memcg that is oom and calls 
mem_cgroup_charge_common().  That allows the oom, but the oom killer will 
find the other threads that are exiting and choose to be a no-op to avoid 
needlessly killing threads.  The coredumper only has PF_DUMPCORE and not 
PF_EXITING so it doesn't get immediately killed.

So we have a decision to either immediately oom kill the coredumper or 
just fail its allocations and exit since this code does seem to have good 
error handling.  If RLIMIT_CORE is relatively small, it's not a problem to 
kill the coredumper and give it access to memory reserves.  We don't want 
to take that chance, however, since memcg allows all charges to be 
bypassed for threads that have been oom killed and have access to memory 
reserves with their TIF_MEMDIE bit set.  In the worst case, when 
RLIMIT_CORE is high or even unlimited, this could quickly cause a system 
oom condition and then we'd be stuck again because the oom killer finds an 
eligible thread with TIF_MEMDIE and all memory reserves have been 
depleted.

> Does the change mean that core dumps may fail where previously they would
> have succeeded even if the system churns a bit trying to write them out?

We haven't seen this in system-wide oom conditions but it shouldn't be 
unlike the memcg case where we completely livelock because all threads are 
waiting for the coredumper to exit and no memory is being freed.  Unless 
the hard limit is increased (or memory hotplugged in the system-wide oom 
condition), this consistely results in a livelock.  With the system-wide 
oom condition it's more likely that another thread can exit that is not 
going to block on waiting for the coredumper to exit and free memory, but 
it's not guaranteed and this patch fixes the memcg case as well.

> If so, should it be a tunable in like /proc/sys/kernel/core_mayoom that
> defaults to 1? Alternatively, would it be better if there was an option
> to synchronously write the core file and discard the page cache pages as
> the dump is written? It would be slower but it might stress the system less.
> 

I didn't think of adding a sysctl because all of its allocations are 
already GFP_KERNEL and are in a killable context where the oom killer 
already could decide to kill something; the problem is that it chooses not 
to do so because it sees threads that are PF_EXITING and defers, but in 
this case that will never happen because for them to fully exit and free 
its memory would require (perhaps a ton of) memory for the coredumper.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>