On Mon, 19 Mar 2012 17:46:47 -0700 (PDT) David Rientjes <rientjes@xxxxxxxxxx> wrote: > On Mon, 19 Mar 2012, Andrew Morton wrote: > > > > Yup, this is the one. We only currently see this when a memcg is at its > > > limit and there are other threads that are trying to exit that are blocked > > > on a coredumper that can no longer get memory. dump_write() calling > > > ->write() (ext4 in this case) causes a livelock when > > > add_to_page_cache_locked() tries to charge the soon-to-be-added pagecache > > > to the coredumper's memcg that is oom and calls > > > mem_cgroup_charge_common(). That allows the oom, but the oom killer will > > > find the other threads that are exiting and choose to be a no-op to avoid > > > needlessly killing threads. The coredumper only has PF_DUMPCORE and not > > > PF_EXITING so it doesn't get immediately killed. > > > > I don't understand the description of the livelock. Does > > add_to_page_cache_locked() succeed, or fail? What does "allows the > > oom" mean? > > > > Sorry if it wasn't clear. The coredumper calling into > add_to_page_cache_locked() calls the oom killer because the memcg is oom > (and would call the global oom killer if the entire system were oom). The > oom killer, both memcg and global, doesn't do anything because it sees > eligible threads with PF_EXITING set. This logic has existed for several > years to avoid needlessly oom killing additional threads when others are > already in the process of exiting and freeing their memory. Those > PF_EXITING threads, however, are blocked on the coredumper to exit in > exit_mm(), so they'll never actually exit. Thus, the coredumper must make > forward progress for anything to actually exit and the oom killer is > useless. > > In this condition, there are a few options: > > - give the coredumper access to memory reserves and allow it to allocate, > essentially oom killing it, > > - fail coredumper memory allocations because of the oom condition and > allow the threads blocked on it to exit, or > > - implement an oom killer timeout that would kill additional threads if > we repeatedly call into it without making forward progress over a small > period of time. > > The first and last, in my opinion, are non-starters because it allows a > complete depletion of memory reserves if the coredumper is chosen and then > nothing is guaranteed to be able to ever exit. Why does option 1 lead to reserve exhaustion? If we have a zillion simultaneous core dumps? > This patch implements the > middle option where we do our best effort to allow the coredump to be > successful (we even try direct reclaim before failing) but choose to fail > before calling into the oom killer and causing a livelock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>