Re: zone state overhead

Mel Gorman <mel@xxxxxxxxx> · Wed, 27 Oct 2010 09:19:11 +0100

On Mon, Oct 25, 2010 at 01:46:19PM +0900, KOSAKI Motohiro wrote:
> > - * Return 1 if free pages are above 'mark'. This takes into account the order
> > + * Return true if free pages are above 'mark'. This takes into account the order
> >   * of the allocation.
> >   */
> > -int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> > -		      int classzone_idx, int alloc_flags)
> > +bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> > +		      int classzone_idx, int alloc_flags, long free_pages)
> 
> static?
> 

Yes, it should be.

> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c5dfabf..ba0c70a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >  		if (zone->all_unreclaimable)
> >  			continue;
> >  
> > -		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> > +		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> >  								0, 0))
> >  			return 1;
> >  	}
> 
> Do we need to change balance_pgdat() too?
> Otherwise, balance_pgdat() return immediately and can make semi-infinite busy loop.
> 

While balance_pgdat is calling zone_watermark_ok() the thresholds are
very low and the expected level of drift is minimal.  I considered the
semi-infinite busy loop to have a worst-case situation of 2 seconds until
the vmstat counters were synced and zone_watermark_ok* values matched.
There is an reasonable expectation that normal allocate/free activity would
sync the values for zone_watermark_ok* before that timeout.

To my surprise though, using zone_watermark_ok_safe() in balance_pgdat()
does not significantly increase the amount of time spent in the _safe()
function so it'll be called in the next version.

> 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 355a9e6..ddee139 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +static int calculate_pressure_threshold(struct zone *zone)
> > +{
> > +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> > +				num_online_cpus())));
> > +}
> 
> On Shaohua's machine,
> 
> 	CPU: 64
> 	MEM: 8GBx4 (=32GB)
> 	per-cpu vm-stat threashold: 98
> 
> 	zone->min = sqrt(32x1024x1024x16)/4 = 5792 KB = 1448 pages
> 	zone->high - zone->low = zone->min/4 = 362pages
> 	pressure-vm-threshold = 362/64 ~= 5
> 
> Hrm, this reduction seems slightly dramatically (98->5). 

Yes, but consider the maximum possible drift;

	percpu-maximum-drift = 5*64 = 320

The value is massively reduced and the cost goes up but this is the value
necessary to avoid a situation where the high watermark is "ok" when in fact
the min watermark can be breached.

> Shaohua, can you please rerun your problem workload on your 64cpus machine with
> applying this patch?
> Of cource, If there is no performance degression, I'm not against this one.
> 

Your patches that adjusted min and high may allow this threshold to grow again.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>