On Fri, Jul 22, 2011 at 2:01 AM, Mel Gorman <mgorman@xxxxxxx> wrote: > On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote: >> > > > <SNIP> >> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p) >> > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; >> > > > set_freezable(); >> > > > >> > > > - order = 0; >> > > > - classzone_idx = MAX_NR_ZONES - 1; >> > > > + order = new_order = 0; >> > > > + classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; >> > > > for ( ; ; ) { >> > > > - unsigned long new_order; >> > > > - int new_classzone_idx; >> > > > int ret; >> > > > >> > > > - new_order = pgdat->kswapd_max_order; >> > > > - new_classzone_idx = pgdat->classzone_idx; >> > > > - pgdat->kswapd_max_order = 0; >> > > > - pgdat->classzone_idx = MAX_NR_ZONES - 1; >> > > > + /* >> > > > + * If the last balance_pgdat was unsuccessful it's unlikely a >> > > > + * new request of a similar or harder type will succeed soon >> > > > + * so consider going to sleep on the basis we reclaimed at >> > > > + */ >> > > > + if (classzone_idx >= new_classzone_idx && order == new_order) { >> > > > + new_order = pgdat->kswapd_max_order; >> > > > + new_classzone_idx = pgdat->classzone_idx; >> > > > + pgdat->kswapd_max_order = 0; >> > > > + pgdat->classzone_idx = pgdat->nr_zones - 1; >> > > > + } >> > > > + >> > > >> > > But in this part. >> > > Why do we need this? >> > >> > Lets say it's a fork-heavy workload and it is routinely being woken >> > for order-1 allocations and the highest zone is very small. For the >> > most part, it's ok because the allocations are being satisfied from >> > the lower zones which kswapd has no problem balancing. >> > >> > However, by reading the information even after failing to >> > balance, kswapd continues balancing for order-1 due to reading >> > pgdat->kswapd_max_order, each time failing for the highest zone. It >> > only takes one wakeup request per balance_pgdat() to keep kswapd >> > awake trying to balance the highest zone in a continual loop. >> >> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned >> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only >> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced. >> > > If a wakeup for order-1 happened during the last pgdat, the > classzone_idx as communicated back from balance_pgdat() is lost and it > will not sleep in this ordering of events > > kswapd other processes > ====== =============== > order = balance_pgdat(pgdat, order, &classzone_idx); > wakeup for order-1 > kswapd balances lower zone > allocate from lower zone > balance_pgdat fails balance for highest zone, returns > with lower classzone_idx and possibly lower order > new_order = pgdat->kswapd_max_order (order == 1) > new_classzone_idx = pgdat->classzone_idx (highest zone) > if (order < new_order || classzone_idx > new_classzone_idx) { > order = new_order; > classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost) > } > order = balance_pgdat(pgdat, order, &classzone_idx); > > The wakup for order-1 at any point during balance_pgdat() is enough to > keep kswapd awake even though the process that called wakeup_kswapd > would be able to allocate from the lower zones without significant > difficulty. > > This is why if balance_pgdat() fails its request, it should go to sleep > if watermarks for the lower zones are met until woken by another > process. Hmm. The role of kswapd is to reclaim pages by background until all of zone meet HIGH_WMARK to prevent costly direct reclaim.(Of course, there is another reason like GFP_ATOMIC). So it's not wrong to consume many cpu usage by design unless other tasks are ready. It would be balanced or unreclaimable at last so it should end up. However, the problem is small part of highest zone is easily [set|reset] to be all_unreclaimabe so the situation could be forever like our example. So fundamental solution is to prevent it that all_unreclaimable is set/reset easily, I think. Unfortunately it have no idea now. In different viewpoint, the problem is that it's too excessive because kswapd is just best-effort and if it got fails, we have next wakeup and even direct reclaim as last resort. In such POV, I think this patch is right and it would be a good solution. Then, other concern is on your reply about KOSAKI's question. I think below your patch is needed. Quote from " 1. Read for balance-request-A (order, classzone) pair 2. Fail balance_pgdat 3. Sleep based on (order, classzone) pair 4. Wake for balance-request-B (order, classzone) pair where balance-request-B != balance-request-A 5. Succeed balance_pgdat 6. Compare order,classzone with balance-request-A which will treat balance_pgdat() as fail and try go to sleep This is not the same as new_classzone_idx being "garbage" but is it what you mean? If so, is this your proposed fix? diff --git a/mm/vmscan.c b/mm/vmscan.c index fe854d7..1a518e6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2770,6 +2770,8 @@ static int kswapd(void *p) kswapd_try_to_sleep(pgdat, order, classzone_idx); order = pgdat->kswapd_max_order; classzone_idx = pgdat->classzone_idx; + new_order = order; + new_classzone_idx = classzone_idx; " - Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href