On Thu 19-01-17 10:07:55, Mel Gorman wrote: [...] > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > When too many pages are isolated, direct reclaim waits on congestion to clear > for up to a tenth of a second. There is no reason to believe that too many > pages are isolated due to dirty pages, reclaim efficiency or congestion. > It may simply be because an extremely large number of processes have entered > direct reclaim at the same time. However, it is possible for the situation > to persist forever and never reach OOM. > > This patch queues processes a waitqueue when too many pages are isolated. > When parallel reclaimers finish shrink_page_list, they wake the waiters > to recheck whether too many pages are isolated. > > The wait on the queue has a timeout as not all sites that isolate pages > will do the wakeup. Depending on every isolation of LRU pages to be perfect > forever is potentially fragile. The specific wakeups occur for page reclaim > and compaction. If too many pages are isolated due to memory failure, > hotplug or directly calling migration from a syscall then the waiting > processes may wait the full timeout. > > Note that the timeout allows the use of waitqueue_active() on the basis > that a race will cause the full timeout to be reached due to a missed > wakeup. This is relatively harmless and still a massive improvement over > unconditionally calling congestion_wait. > > Direct reclaimers that cannot isolate pages within the timeout will consider > return to the caller. This is somewhat clunky as it won't return immediately > and make go through the other priorities and slab shrinking. Eventually, > it'll go through a few iterations of should_reclaim_retry and reach the > MAX_RECLAIM_RETRIES limit and consider going OOM. I cannot really say I would like this. It's just much more complex than necessary. I definitely agree that congestion_wait while waiting for too_many_isolated is a crude hack. This patch doesn't really resolve my biggest worry, though, that we go OOM with too many pages isolated as your patch doesn't alter zone_reclaimable_pages to reflect those numbers. Anyway, I think both of us are probably overcomplicating things a bit. Your waitqueue approach is definitely better semantically than the congestion_wait because we are waiting for a different event than the API is intended for. On the other hand a mere schedule_timeout_interruptible might work equally well in the real life. On the other side I might really over emphasise the role of NR_ISOLATED* counts. It might really turn out that we can safely ignore them and it won't be the end of the world. So what do you think about the following as a starting point. If we ever see oom reports with high number of NR_ISOLATED* which are part of the oom report then we know we have to do something about that. Those changes would at least be driven by a real usecase rather than theoretical scenarios. So what do you think about the following? Tetsuo, would you be willing to run this patch through your torture testing please? --- >From 47cba23b5b50260b533d7ad57a4c9e6a800d9b20 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxxx> Date: Thu, 19 Jan 2017 12:11:56 +0100 Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Tetsuo Handa has reported [1] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@xxxxxxxxxxxxxxxxxxx Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> --- mm/vmscan.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index a60066d4521b..d07380ba1f9e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1718,9 +1718,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; + bool stalled = false; while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (stalled) + return 0; + + /* wait a bit for the reclaimer. */ + schedule_timeout_interruptible(HZ/10); + stalled = true; /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) -- 2.11.0 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>