Andrew Morton wrote: > On Thu, 24 Aug 2017 21:18:25 +0900 Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: > > > We are doing last second memory allocation attempt before calling > > out_of_memory(). But since slab shrinker functions might indirectly > > wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory > > allocations via sleeping locks, calling slab shrinker functions from > > node_reclaim() from get_page_from_freelist() with oom_lock held has > > possibility of deadlock. Therefore, make sure that last second memory > > allocation attempt does not call slab shrinker functions. > > I wonder if there's any way we could gert lockdep to detect this sort > of thing. That is hopeless regarding MM subsystem. The root problem is that MM subsystem assumes that somebody else shall make progress for me. And direct reclaim does not check for other thread's progress (e.g. too_many_isolated() looping forever waiting for kswapd) and continue consuming CPU resource (e.g. deprive a thread doing schedule_timeout_killable() with oom_lock held of all CPU time for doing pointless get_page_from_freelist() etc.). Since the page allocator chooses retry the attempt rather than wait for locks, lockdep won't help. The dependency is spreaded to all threads with timing and threshold checks, preventing threads from calling operations which lockdep will detect. I do wish we can get rid of __GFP_DIRECT_RECLAIM and offload memory reclaim operation to some kswapd-like kernel threads. Then, we would be able to check progress of relevant threads and invoke the OOM killer as needed (rather than doing __GFP_FS check in out_of_memory()), as well as implementing __GFP_KILLABLE. > > Has the deadlock been observed in testing? Do we think this fix > should be backported into -stable? I have never observed this deadlock, but it is hard for everybody to know if he/she hit this deadlock. The only clue which is available since 4.9+ (though still unreliable) is warn_alloc() complaining memory allocation is stalling for some reason. For users using 2.6.18/2.6.32/3.10 kernels, they have absolutely no clue to know it (other than using SysRq-t etc. which is generating too much messages and asking for too much efforts). Judging from my experience at a support center, it is too difficult for users to report memory allocation hangs. It requires users to stand by in front of the console twenty-four seven so that we get SysRq-t etc. whenever a memory allocation related problem is suspected. We can't ask users for such effort. There is no report does not mean memory allocation hang is not occurring in the real life. But nobody (other than me) is interested in adding asynchronous watchdog like kmallocwd. Thus, I'm spending much effort for finding potential lockup bugs using stress tests, and Michal do not care bugs which are found by stress tests, and nobody else are responding, and users do not have a reliable mean to report lockup bugs caused by memory allocation (e.g. kmallocwd). Sigh..... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>