Re: [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Sat, 26 Aug 2017 10:28:24 +0900

Andrew Morton wrote:
> On Thu, 24 Aug 2017 21:18:25 +0900 Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> 
> > We are doing last second memory allocation attempt before calling
> > out_of_memory(). But since slab shrinker functions might indirectly
> > wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
> > allocations via sleeping locks, calling slab shrinker functions from
> > node_reclaim() from get_page_from_freelist() with oom_lock held has
> > possibility of deadlock. Therefore, make sure that last second memory
> > allocation attempt does not call slab shrinker functions.
> 
> I wonder if there's any way we could gert lockdep to detect this sort
> of thing.

That is hopeless regarding MM subsystem.

The root problem is that MM subsystem assumes that somebody else shall make
progress for me. And direct reclaim does not check for other thread's progress
(e.g. too_many_isolated() looping forever waiting for kswapd) and continue
consuming CPU resource (e.g. deprive a thread doing schedule_timeout_killable()
with oom_lock held of all CPU time for doing pointless get_page_from_freelist()
etc.).

Since the page allocator chooses retry the attempt rather than wait for locks,
lockdep won't help. The dependency is spreaded to all threads with timing and
threshold checks, preventing threads from calling operations which lockdep
will detect.

I do wish we can get rid of __GFP_DIRECT_RECLAIM and offload memory reclaim
operation to some kswapd-like kernel threads. Then, we would be able to check
progress of relevant threads and invoke the OOM killer as needed (rather than
doing __GFP_FS check in out_of_memory()), as well as implementing __GFP_KILLABLE.

> 
> Has the deadlock been observed in testing?  Do we think this fix
> should be backported into -stable?

I have never observed this deadlock, but it is hard for everybody to know
if he/she hit this deadlock. The only clue which is available since 4.9+
(though still unreliable) is warn_alloc() complaining memory allocation is
stalling for some reason. For users using 2.6.18/2.6.32/3.10 kernels, they
have absolutely no clue to know it (other than using SysRq-t etc. which is
generating too much messages and asking for too much efforts).

Judging from my experience at a support center, it is too difficult for users
to report memory allocation hangs. It requires users to stand by in front of
the console twenty-four seven so that we get SysRq-t etc. whenever a memory
allocation related problem is suspected. We can't ask users for such effort.
There is no report does not mean memory allocation hang is not occurring in
the real life. But nobody (other than me) is interested in adding asynchronous
watchdog like kmallocwd. Thus, I'm spending much effort for finding potential
lockup bugs using stress tests, and Michal do not care bugs which are found by
stress tests, and nobody else are responding, and users do not have a reliable
mean to report lockup bugs caused by memory allocation (e.g. kmallocwd).

Sigh.....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>