Hi > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index c5dfabf..47ba29e 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -2378,7 +2378,9 @@ static int kswapd(void *p) > > > */ > > > if (!sleeping_prematurely(pgdat, order, remaining)) { > > > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > > > + enable_pgdat_percpu_threshold(pgdat); > > > schedule(); > > > + disable_pgdat_percpu_threshold(pgdat); > > > > If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark. > > Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold(). > > > > I don't *think* so but lets explore that possibility. For this to occur, all > CPUs would have to be allocating all of their memory from the one node (4096 > CPUs is not going to be UMA) which is not going to happen. But allocations > from one node could be falling over to others of course. > > Lets take an early condition that has to occur for a 4096 CPU machine to > get into trouble - node 0 exhausted and moving to node 1 and counter drift > makes us think everything is fine. > > __alloc_pages_nodemask > -> get_page_from_freelist > -> zone_watermark_ok == true (because we are drifting) > -> buffered_rmqueue > -> __rmqueue (fails eventually, no pages despite watermark_ok) > -> __alloc_pages_slowpath > -> wake_all_kswapd() > ... > > kswapd wakes > -> disable_pgdat_percpu_threshold() > > i.e. as each node becomes exhausted in reality, kswapd will wake up, disable > the thresholds until the high watermark is back and go back to sleep. I'm > not seeing how we'd get into a situation where all kswapds are asleep at the > same time while each allocator allocates all of memory without managing to > wake kswapd. Even GFP_ATOMIC allocations will wakeup kswapd. > > Hence, I think the current patch of disabling thresholds while kswapd is > awake to be sufficient to avoid livelock due to memory exhaustion and > counter drift. > In this case, wakeup_kswapd() don't wake kswapd because --------------------------------------------------------------------------------- void wakeup_kswapd(struct zone *zone, int order) { pg_data_t *pgdat; if (!populated_zone(zone)) return; pgdat = zone->zone_pgdat; if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) return; // HERE --------------------------------------------------------------------------------- So, if we take your approach, we need to know exact free pages in this. But, zone_page_state_snapshot() is slow. that's dilemma. > > Hmmm.... > > This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely > > unbalanced definition. > > > > zone watermak: very few (few mega bytes) > > propotional sqrt(mem) > > no propotional nr-cpus > > > > per-cpu stat threshold: relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-) > > propotional log(mem) > > propotional log(nr-cpus) > > > > It mean, much cpus break watermark assumption..... > > > > They are for different things. watermarks are meant to prevent livelock > due to memory exhaustion. per-cpu thresholds are so that counters have > acceptable performance. The assumptions of watermarks remain the same > but we have to correctly handle when counter drift can break watermarks. ok. > > > +void enable_pgdat_percpu_threshold(pg_data_t *pgdat) > > > +{ > > > + struct zone *zone; > > > + int cpu; > > > + int threshold; > > > + > > > + for_each_populated_zone(zone) { > > > + if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat) > > > + continue; > > > + > > > + threshold = calculate_threshold(zone); > > > + for_each_online_cpu(cpu) > > > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > > + = threshold; > > > + } > > > +} > > > > disable_pgdat_percpu_threshold() and enable_pgdat_percpu_threshold() are > > almostly same. can you merge them? > > > > I wondered the same but as thresholds are calculated per-zone, I didn't see > how that could be handled in a unified function without using a callback > function pointer. If I used callback functions and an additional boolean, I > could merge refresh_zone_stat_thresholds(), disable_pgdat_percpu_threshold() > and enable_pgdat_percpu_threshold() but I worried the end-result would be > a bit unreadable and hinder review. I could roll a standalone patch that > merges the three if we end up agreeing on this patches general approach > to counter drift. ok, I think you are right. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>