On Wed 28-04-21 09:05:06, Yu Zhao wrote: > On Wed, Apr 28, 2021 at 5:55 AM Michal Hocko <mhocko@xxxxxxxx> wrote: [...] > > > @@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > > set_task_reclaim_state(current, &sc.reclaim_state); > > > trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask); > > > > > > + nr_cpus = current_is_kswapd() ? 0 : num_online_cpus(); > > > + while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) { > > > + if (schedule_timeout_killable(HZ / 10)) > > > + return SWAP_CLUSTER_MAX; > > > + } > > > + > > > nr_reclaimed = do_try_to_free_pages(zonelist, &sc); > > > > > > + if (nr_cpus) > > > + atomic_dec(&pgdat->nr_reclaimers); > > > + > > > trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); > > > set_task_reclaim_state(current, NULL); > > > > This will surely break any memcg direct reclaim. > > Mind elaborating how it will "surely" break any memcg direct reclaim? I was wrong here. I though this is done in a common path for all direct reclaimers (likely mixed up try_to_free_pages with do_try_free_pages). Sorry about the confusion. Still, I do not think that the above heuristic will work properly. Different reclaimers have a different reclaim target (e.g. lower zones and/or numa node mask) and strength (e.g. GFP_NOFS vs. GFP_KERNEL). A simple count based throttling would be be prone to different sorts of priority inversions. -- Michal Hocko SUSE Labs