On Thu, Apr 22, 2021 at 2:17 PM Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote: > > > > On 4/22/21 10:13 AM, Yu Zhao wrote: > > > @@ -3302,6 +3252,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, > > unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > gfp_t gfp_mask, nodemask_t *nodemask) > > { > > + int nr_cpus; > > unsigned long nr_reclaimed; > > struct scan_control sc = { > > .nr_to_reclaim = SWAP_CLUSTER_MAX, > > @@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > set_task_reclaim_state(current, &sc.reclaim_state); > > trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask); > > > > + nr_cpus = current_is_kswapd() ? 0 : num_online_cpus(); > > + while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) { > > + if (schedule_timeout_killable(HZ / 10)) > > 100 msec seems like a long time to wait. The original code in shrink_inactive_list > choose 100 msec sleep because the sleep happens only once in the while loop and 100 msec was > used to check for stalling. In this case the loop can go on for a while and the > #reclaimers can go down below the sooner than 100 msec. Seems like it should be checked > more often. You are not looking at the original code -- the original code sleeps indefinitely. It was changed by commit db73ee0d46 to fix a problem that doesn't apply to the code above. HZ/10 is purely arbitrary but that's ok because we assume normally nobody hits it. If you do often, we need to figure out why and how not to hit it so often.