On Thu, Apr 22, 2021 at 12:52 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Thu, Apr 22, 2021 at 10:13 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > [...] > > spin_lock_irq(&lruvec->lru_lock); > > @@ -3302,6 +3252,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, > > unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > gfp_t gfp_mask, nodemask_t *nodemask) > > { > > + int nr_cpus; > > unsigned long nr_reclaimed; > > struct scan_control sc = { > > .nr_to_reclaim = SWAP_CLUSTER_MAX, > > @@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > set_task_reclaim_state(current, &sc.reclaim_state); > > trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask); > > > > + nr_cpus = current_is_kswapd() ? 0 : num_online_cpus(); > > kswapd does not call this function (directly or indirectly). > > > + while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) { > > At most nr_nodes * nr_cpus direct reclaimers are allowed? > > > + if (schedule_timeout_killable(HZ / 10)) > > trace_mm_vmscan_direct_reclaim_end() and set_task_reclaim_state(NULL)? > > > + return SWAP_CLUSTER_MAX; > > + } > > + > > nr_reclaimed = do_try_to_free_pages(zonelist, &sc); > > > > + if (nr_cpus) > > + atomic_dec(&pgdat->nr_reclaimers); > > + > > trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); > > set_task_reclaim_state(current, NULL); > > BTW I think this approach needs to be more sophisticated. What if a > direct reclaimer within the reclaim is scheduled away and is out of > CPU quota? More sophisticated to what end? We wouldn't worry about similar scenarios that we ran out of cpu quota while holding resources like a mutex, Si why this one is different, especially given that we already allow many reclaimers to run concurrently?