Re: [PATCH 0/3] Limit runaway reclaim due to watermark boosting

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Wed, 26 Feb 2020 08:04:26 +0000

On Tue, Feb 25, 2020 at 06:51:30PM -0800, Andrew Morton wrote:
> On Tue, 25 Feb 2020 14:15:31 +0000 Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> 
> > Ivan Babrou reported the following
> 
> http://lkml.kernel.org/r/CABWYdi1eOUD1DHORJxTsWPMT3BcZhz++xP1pXhT=x4SgxtgQZA@xxxxxxxxxxxxxx
> is helpful.
> 

Noted for future reference.

> > 	Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
> > 	an external fragmentation event occurs") introduced undesired
> > 	effects in our environment.
> > 
> > 	  * NUMA with 2 x CPU
> > 	  * 128GB of RAM
> > 	  * THP disabled
> > 	  * Upgraded from 4.19 to 5.4
> > 
> > 	Before we saw free memory hover at around 1.4GB with no
> > 	spikes. After the upgrade we saw some machines decide that they
> > 	need a lot more than that, with frequent spikes above 10GB,
> > 	often only on a single numa node.
> > 
> > There have been a few reports recently that might be watermark boost
> > related. Unfortunately, finding someone that can reproduce the problem
> > and test a patch has been problematic.  This series intends to limit
> > potential damage only.
> 
> It's problematic that we don't understand what's happening.  And these
> palliatives can only reduce our ability to do that.
> 

Not for certain no, but we do know that there are conditions whereby
node 0 can end up reclaiming excessively for extended periods of time.
The available evidence does match a pattern whereby a lower zone on node
0 is getting stuck in a boosted state.

> Rik seems to have the means to reproduce this (or something similar)
> and it seems Ivan can test patches three weeks hence. 

If Rik can reproduce it great but I have a strong feeling that Ivan may
never be able to test this if it requires a production machine which is
why I did not wait the three weeks.

> So how about a
> debug patch which will help figure out what's going on in there?

A debug patch would not help much in this case given that we
have tracepoints. An ftrace containing mm_page_alloc_extfrag,
mm_vmscan_kswapd_wake, mm_vmscan_wakeup_kswapd and
mm_vmscan_node_reclaim_begin would be a big help for 30 seconds while the
problem is occurring would work. Ideally mm_vmscan_lru_shrink_inactive
would also be included to capture the priority but the size of the trace
is what's going to be problematic.

mm_page_alloc_extfrag would be correlated with the conditions that boost
the watermarks and the others would track what kswapd is doing to see if
it's persistently reclaiming. If they are, mm_vmscan_lru_shrink_inactive
would tell if it's persistently reclaiming at priority DEF_PRIORITY - 2
which would prove the patch would at least mitigate the problem.

It would be more preferable to have a description of a testcase that
reproduces the problem and I'll capture/analyse the trace myself.
It would also be something I could slot into a test grid to catch the
problem happening again in the future.

-- 
Mel Gorman
SUSE Labs