On Thu, Jul 27, 2017 at 06:06:57PM +0200, Vlastimil Babka wrote: > This patch deals with a corner case found when testing kcompactd with a very > simple testcase that first fragments memory (by creating a large shmem file and > then punching hole in every even page) and then uses artificial order-9 > GFP_NOWAIT allocations in a loop. This is freshly after virtme-run boot in KVM > and no other activity. > > What happens is that kswapd always reclaims too little to get over > compact_gap() in kswapd_shrink_node(), so it doesn't set sc->order to 0, thus > "goto kswapd_try_sleep" in kswapd() doesn't happen. In the next iteration of > kswapd() loop, alloc_order and reclaim_order is read again from > pgdat->kswapd_order, which the previous iteration has reset to 0 and there was > no other kswapd wakeup meanwhile (the workload inserts short sleeps between > allocations). With the working order 0, node appears balanced and > wakeup_kcompactd() does nothing. > The risk with a change like this is that there is an introduction of kswapd-stuck-at-100%-cpu reclaiming for high order pages. Consider for example this part > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, > +/* > + * Return true if kswapd fully slept because pgdat was balanced and there was > + * no premature wakeup. > + */ > +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, > unsigned int classzone_idx) > { > long remaining = 0; > DEFINE_WAIT(wait); > + bool ret = false; > > if (freezing(current) || kthread_should_stop()) > - return; > + return false; > > prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); > ... > @@ -3493,23 +3491,32 @@ static int kswapd(void *p) > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > set_freezable(); > > - pgdat->kswapd_order = 0; > + pgdat->kswapd_order = alloc_order = reclaim_order = 0; > pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > for ( ; ; ) { > bool ret; > > - alloc_order = reclaim_order = pgdat->kswapd_order; > + alloc_order = reclaim_order = max(alloc_order, pgdat->kswapd_order); > classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > kswapd_try_sleep: > - kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > - classzone_idx); > - > - /* Read the new order and classzone_idx */ > - alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = kswapd_classzone_idx(pgdat, 0); > - pgdat->kswapd_order = 0; > - pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > + if (kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > + classzone_idx)) { > + > + /* Read the new order and classzone_idx */ > + alloc_order = reclaim_order = pgdat->kswapd_order; > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > + pgdat->kswapd_order = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > + } else { > + /* > + * We failed to sleep, so continue on the current order > + * and classzone_idx, unless they increased. > + */ > + alloc_order = max(alloc_order, pgdat->kswapd_order); > + reclaim_order = max(reclaim_order, pgdat->kswapd_order) ; > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > + } > > ret = try_to_freeze(); > if (kthread_should_stop()) kswapd_try_to_sleep returns true only if it fully slept. Now, consider a case where kswapd is woken for order-9, fails and there are streaming allocators that are keeping kswapd awake between the low/high watermark. Even though all subsequent wakeups are for potentially for order-0, the false branch above keeps kswapd at order-9. You should be very wary of keeping kswapd awake for high-order allocations and somehow defer to either kcompactd or push it into direct reclaim. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>