Re: [PATCH] mm: fix negative nr_isolated counts

Vlastimil Babka <vbabka@xxxxxxx> · Thu, 12 Feb 2015 09:18:14 +0100

On 02/11/2015 10:09 PM, Andrew Morton wrote:
On Tue, 10 Feb 2015 23:06:09 -0800 (PST) Hugh Dickins <hughd@xxxxxxxxxx> wrote:

The vmstat interfaces are good at hiding negative counts (at least
when CONFIG_SMP); but if you peer behind the curtain, you find that
nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
more negative: so they can absorb larger and larger numbers of isolated
pages, yet still appear to be zero.

I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
but I guess it's there for a good reason, in which case we ought to get
too_many_isolated() working again.

The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
to call acct_isolated() to increment them.

So if I'm understanding this correctly, shrink_inactive_list()'s call
to congestion_wait() basically never happens?

I think so, the more the counters go negative, the less chance of 
congestion_wait() to happen from there.

If so I'm pretty reluctant to merge this up until it has had plenty of
careful testing - there's a decent chance that it will make the kernel
behave worse.

You mean "worse" by letting shrink_inactive_list() call 
congestion_wait() again, as it used to before 3.18, since 2009 it seems?
Maybe it's not needed anymore, but it IMHO shouldn't get disabled by 
accident, but properly evaluated and removed. Hugh's patch just fixes 
the accidental disable.

Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx # v3.18+

And why -stable?  What user-visible problem is the bug causing?

Commit 35cd78156c "vmscan: throttle direct reclaim when too many pages 
are isolated already" by Rik seems to have introduced this 
congestion_wait() based on too_many_isolated(). The bug it was fixing:

 "When way too many processes go into direct reclaim, it is possible 
for all of the pages to be taken off the LRU. One result of this is that 
the next process in the page reclaim code thinks there are no 
reclaimable pages left and triggers an out of memory kill."

So either this is now prevented by something else and 
too_many_isolated() could go away, or we should restore its 
functionality. Any idea, Rik?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>