Re: zone state overhead

Mel Gorman <mel@xxxxxxxxx> · Wed, 29 Sep 2010 11:03:07 +0100

On Tue, Sep 28, 2010 at 09:02:59PM -0700, David Rientjes wrote:
> On Tue, 28 Sep 2010, Mel Gorman wrote:
> 
> > This is true. It's helpful to remember why this patch exists. Under heavy
> > memory pressure, large machines run the risk of live-locking because the
> > NR_FREE_PAGES gets out of sync. The test case mentioned above is under
> > memory pressure so it is potentially at risk. Ordinarily, we would be less
> > concerned with performance under heavy memory pressure and more concerned with
> > correctness of behaviour. The percpu_drift_mark is set at a point where the
> > risk is "real".  Lowering it will help performance but increase risk. Reducing
> > stat_threshold shifts the cost elsewhere by increasing the frequency the
> > vmstat counters are updated which I considered to be worse overall.
> > 
> > Which of these is better or is there an alternative suggestion on how
> > this livelock can be avoided?
> > 
> 
> I don't think the risk is quite real based on the calculation of 
> percpu_drift_mark using the high watermark instead of the min watermark.  
> For Shaohua's 64 cpu system:
> 
> Node 3, zone   Normal
> pages free     2055926
>         min      1441
>         low      1801
>         high     2161
>         scanned  0
>         spanned  2097152
>         present  2068480
>   vm stats threshold: 98
> 
> It's possible that we'll be 98 pages/cpu * 64 cpus = 6272 pages off in the 
> NR_FREE_PAGES accounting at any given time. 

Right.

> So to avoid depleting memory 
> reserves at the min watermark, which is livelock, and unnecessarily 
> spending time doing reclaim, percpu_drift_mark should be
> 1801 + 6272 = 8073 pages.  Instead, we're currently using the high 
> watermark, so percpu_drift_mark is 8433 pages.
> 

The point of calculating from the high watermark was to prevent kswapd
going to sleep prematurely but if it can be shown the problem goes away
using just the low watermark, I'd go with it. I'm skeptical though for
reasons I outline below.

> It's plausible that we never reclaim sufficient memory that we ever get 
> above the high watermark since we only trigger reclaim when we can't 
> allocate above low, so we may be stuck calling zone_page_state_snapshot() 
> constantly.
> 

Except that zone_page_state_snapshot() is only called while kswapd is
awake which is the proxy indicator of pressure. Just being below
percpu_drift_mark is not enough to call zone_page_state_snapshot.

> I'd be interested to see if this patch helps.
> ---
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -154,7 +154,7 @@ static void refresh_zone_stat_thresholds(void)
>  		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
>  		max_drift = num_online_cpus() * threshold;
>  		if (max_drift > tolerate_drift)
> -			zone->percpu_drift_mark = high_wmark_pages(zone) +
> +			zone->percpu_drift_mark = low_wmark_pages(zone) +
>  					max_drift;
>  	}

Well, this in itself would not fix one problem you highlight - kswapd does
not reclaim enough to keep a zone above the percpu_drift_mark meaning that
the instant it wakes, zone_page_state_snapshot() is in use and continually
in use while kswapd is awake. These are the marks of interest at the moment;

min		1441
low		1801
high		2161
driftdanger	8433

kswapd can be mostly awake, keeping ahead of the allocators by
maintaining a free level somewhere between low and high while
zone_page_state_snapshot() is continually in use.

Maybe when percpu_drift_mark is set due to large machines, the
watermarks need to change so that high = percpu_drift_mark + low? That
would make the marks

min		1441
low		1801
driftdanger	8073
high		9874

That would improve the situation slightly by widening the window between
kswapd going to sleep and waking up due to memory pressure while also having
a window where kswapd is awake but zone_page_state_snapshot() is not in
use. It doesn't help if the pressure is enough to keep kswapd awake and at
a level between low and driftdanger.

Alternatively we could revisit Christoph's suggestion of modifying
stat_threshold when under pressure instead of zone_page_state_snapshot. Maybe
by temporarily stat_threshold when kswapd is awake to a per-zone value
such that

zone->low + threshold*nr_online_cpus < high

?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>