> > @@ -2378,7 +2378,9 @@ static int kswapd(void *p) > > */ > > if (!sleeping_prematurely(pgdat, order, remaining)) { > > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > > + enable_pgdat_percpu_threshold(pgdat); > > schedule(); > > + disable_pgdat_percpu_threshold(pgdat); > > If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark. > Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold(). > > Hmmm.... > This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely > unbalanced definition. > > zone watermak: very few (few mega bytes) > propotional sqrt(mem) > no propotional nr-cpus > > per-cpu stat threshold: relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-) > propotional log(mem) > propotional log(nr-cpus) > > It mean, much cpus break watermark assumption..... I've tryied to implement different patch. The key idea is, max-drift is very small value if the system don't have >1024CPU. three model case: Case1: typical desktop CPU: 2 MEM: 2GB max-drift = 2 x log2(2) x log2(2x1024/128) x 2 = 40 pages Case2: relatively large server CPU: 64 MEM: 8GBx4 (=32GB) max-drift = 2 x log2(64) x log2(8x1024/128) x 64 = 6272 pages = 24.5MB Case3: ultimately big server CPU: 2048 MEM: 64GBx256 (=16TB) max-drift = 125 x 2048 = 256000 pages = 1000MB So, I think we can accept 20MB memory waste for good performance. but can't accept 1000MB waste ;-) Fortunatelly, >1000CPU machine are always used for HPC in nowadays. then, zone_page_state_snapshot() overhead isn't so big matter on it. So, My idea is, Case1 and Case2: reserve max-drift pages at first. Case3: using zone_page_state_snapshot() -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>