[RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot()

KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> · Wed, 13 Oct 2010 15:25:59 +0900 (JST)

> > @@ -2378,7 +2378,9 @@ static int kswapd(void *p)
> >  				 */
> >  				if (!sleeping_prematurely(pgdat, order, remaining)) {
> >  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +					enable_pgdat_percpu_threshold(pgdat);
> >  					schedule();
> > +					disable_pgdat_percpu_threshold(pgdat);
> 
> If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark.
> Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold().
> 
> Hmmm....
> This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely
> unbalanced definition.
> 
> zone watermak:             very few (few mega bytes)
>                                        propotional sqrt(mem)
>                                        no propotional nr-cpus
> 
> per-cpu stat threshold:  relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-)
>                                        propotional log(mem)
>                                        propotional log(nr-cpus)
> 
> It mean, much cpus break watermark assumption.....

I've tryied to implement different patch.
The key idea is, max-drift is very small value if the system don't have
>1024CPU.

three model case: 

    Case1: typical desktop
      CPU: 2
      MEM: 2GB
      max-drift = 2 x log2(2) x log2(2x1024/128) x 2 = 40 pages

    Case2: relatively large server
      CPU: 64
      MEM: 8GBx4 (=32GB)
      max-drift = 2 x log2(64) x log2(8x1024/128) x 64 = 6272 pages = 24.5MB

    Case3: ultimately big server
      CPU: 2048
      MEM: 64GBx256 (=16TB)
      max-drift = 125 x 2048 = 256000 pages = 1000MB

So, I think we can accept 20MB memory waste for good performance. but can't
accept 1000MB waste ;-)
Fortunatelly, >1000CPU machine are always used for HPC in nowadays. then,
zone_page_state_snapshot() overhead isn't so big matter on it.

So, My idea is,

Case1 and Case2: reserve max-drift pages at first.
Case3:           using zone_page_state_snapshot()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>