Hello, > ==== CUT HERE ==== > mm: page allocator: Adjust the per-cpu counter threshold when memory is low > > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when > memory is low] noted that watermarks were based on the vmstat > NR_FREE_PAGES. To avoid synchronization overhead, these counters are > maintained on a per-cpu basis and drained both periodically and when a > threshold is above a threshold. On large CPU systems, the difference > between the estimate and real value of NR_FREE_PAGES can be very high. > The system can get into a case where pages are allocated far below the > min watermark potentially causing livelock issues. The commit solved the > problem by taking a better reading of NR_FREE_PAGES when memory was low. > > Unfortately, as reported by Shaohua Li this accurate reading can consume > a large amount of CPU time on systems with many sockets due to cache > line bouncing. This patch takes a different approach. For large machines > where counter drift might be unsafe and while kswapd is awake, the per-cpu > thresholds for the target pgdat are reduced to limit the level of drift > to what should be a safe level. This incurs a performance penalty in heavy > memory pressure by a factor that depends on the workload and the machine but > the machine should function correctly without accidentally exhausting all > memory on a node. There is an additional cost when kswapd wakes and sleeps > but the event is not expected to be frequent - in Shaohua's test case, > there was one recorded sleep and wake event at least. Interesting. I've reveiwed this one. > > Reported-by: Shaohua Li <shaohua.li@xxxxxxxxx> > Signed-off-by: Mel Gorman <mel@xxxxxxxxx> > --- > include/linux/mmzone.h | 6 ------ > include/linux/vmstat.h | 2 ++ > mm/mmzone.c | 21 --------------------- > mm/page_alloc.c | 4 ++-- > mm/vmscan.c | 2 ++ > mm/vmstat.c | 42 +++++++++++++++++++++++++++++++++++++++++- > 6 files changed, 47 insertions(+), 30 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 3984c4e..343fd5c 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone) > return test_bit(ZONE_OOM_LOCKED, &zone->flags); > } > > -#ifdef CONFIG_SMP > -unsigned long zone_nr_free_pages(struct zone *zone); > -#else > -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) > -#endif /* CONFIG_SMP */ > - > /* > * The "priority" of VM scanning is how much of the queues we will scan in one > * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index eaaea37..c67d333 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); > extern void __dec_zone_state(struct zone *, enum zone_stat_item); > > void refresh_cpu_vm_stats(int); > +void disable_pgdat_percpu_threshold(pg_data_t *pgdat); > +void enable_pgdat_percpu_threshold(pg_data_t *pgdat); > #else /* CONFIG_SMP */ > > /* > diff --git a/mm/mmzone.c b/mm/mmzone.c > index e35bfb8..f5b7d17 100644 > --- a/mm/mmzone.c > +++ b/mm/mmzone.c > @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn, > return 1; > } > #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ > - > -#ifdef CONFIG_SMP > -/* Called when a more accurate view of NR_FREE_PAGES is needed */ > -unsigned long zone_nr_free_pages(struct zone *zone) > -{ > - unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); > - > - /* > - * While kswapd is awake, it is considered the zone is under some > - * memory pressure. Under pressure, there is a risk that > - * per-cpu-counter-drift will allow the min watermark to be breached > - * potentially causing a live-lock. While kswapd is awake and > - * free pages are low, get a better estimate for free pages > - */ > - if (nr_free_pages < zone->percpu_drift_mark && > - !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) > - return zone_page_state_snapshot(zone, NR_FREE_PAGES); Now, we can remove zone_page_state_snapshot() too. > - > - return nr_free_pages; > -} > -#endif /* CONFIG_SMP */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a8cfa9c..a9b4542 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > { > /* free_pages my go negative - that's OK */ > long min = mark; > - long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; > + long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; > int o; > > if (alloc_flags & ALLOC_HIGH) > @@ -2436,7 +2436,7 @@ void show_free_areas(void) > " all_unreclaimable? %s" > "\n", > zone->name, > - K(zone_nr_free_pages(zone)), > + K(zone_page_state(zone, NR_FREE_PAGES)), > K(min_wmark_pages(zone)), > K(low_wmark_pages(zone)), > K(high_wmark_pages(zone)), > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c5dfabf..47ba29e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2378,7 +2378,9 @@ static int kswapd(void *p) > */ > if (!sleeping_prematurely(pgdat, order, remaining)) { > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > + enable_pgdat_percpu_threshold(pgdat); > schedule(); > + disable_pgdat_percpu_threshold(pgdat); If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark. Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold(). Hmmm.... This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely unbalanced definition. zone watermak: very few (few mega bytes) propotional sqrt(mem) no propotional nr-cpus per-cpu stat threshold: relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-) propotional log(mem) propotional log(nr-cpus) It mean, much cpus break watermark assumption..... > } else { > if (remaining) > count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 355a9e6..19bd4a1 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat); > > #ifdef CONFIG_SMP > > +static int calculate_pressure_threshold(struct zone *zone) > +{ > + return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) / > + num_online_cpus()))); > +} > + > static int calculate_threshold(struct zone *zone) > { > int threshold; > @@ -159,6 +165,40 @@ static void refresh_zone_stat_thresholds(void) > } > } > > +void disable_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + > + for_each_populated_zone(zone) { > + if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat) > + continue; We don't use for_each_populated_zone() and "zone->zone_pgdat != pgdat" combination in almost place. for (i = 0; i < pgdat->nr_zones; i++) { is enough? > + > + threshold = calculate_pressure_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > > + } > +} > + > +void enable_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + > + for_each_populated_zone(zone) { > + if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat) > + continue; > + > + threshold = calculate_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > + } > +} disable_pgdat_percpu_threshold() and enable_pgdat_percpu_threshold() are almostly same. can you merge them? > + > /* > * For use when we know that interrupts are disabled. > */ > @@ -826,7 +866,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > "\n scanned %lu" > "\n spanned %lu" > "\n present %lu", > - zone_nr_free_pages(zone), > + zone_page_state(zone, NR_FREE_PAGES), > min_wmark_pages(zone), > low_wmark_pages(zone), > high_wmark_pages(zone), > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxxx For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>