On Thu, Apr 29, 2021 at 4:00 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Wed 28-04-21 09:05:06, Yu Zhao wrote: > > On Wed, Apr 28, 2021 at 5:55 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > [...] > > > > @@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > > > set_task_reclaim_state(current, &sc.reclaim_state); > > > > trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask); > > > > > > > > + nr_cpus = current_is_kswapd() ? 0 : num_online_cpus(); > > > > + while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) { > > > > + if (schedule_timeout_killable(HZ / 10)) > > > > + return SWAP_CLUSTER_MAX; > > > > + } > > > > + > > > > nr_reclaimed = do_try_to_free_pages(zonelist, &sc); > > > > > > > > + if (nr_cpus) > > > > + atomic_dec(&pgdat->nr_reclaimers); > > > > + > > > > trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); > > > > set_task_reclaim_state(current, NULL); > > > > > > This will surely break any memcg direct reclaim. > > > > Mind elaborating how it will "surely" break any memcg direct reclaim? > > I was wrong here. I though this is done in a common path for all direct > reclaimers (likely mixed up try_to_free_pages with do_try_free_pages). > Sorry about the confusion. > > Still, I do not think that the above heuristic will work properly. > Different reclaimers have a different reclaim target (e.g. lower zones > and/or numa node mask) and strength (e.g. GFP_NOFS vs. GFP_KERNEL). A > simple count based throttling would be be prone to different sorts of > priority inversions. I see where your concern is coming from. Let's look at it from multiple angles, and hopefully this will clear things up. 1, looking into this approach: This approach limits the number of direct reclaimers without any bias. It doesn't favor or disfavor anybody. IOW, everyone has an equal chance to run, regardless of the reclaim parameters. So where does the inversion come from? 2, comparing it with the existing code: Both try to limit direct reclaims,: one by the number of isolated pages and the other by the number of concurrent direct reclaimers. Neither numbers are correlated with any parameters you mentioned above except the following: too_many_isolated() { ... /* * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they * won't get blocked by normal direct-reclaimers, forming a circular * deadlock. */ if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) inactive >>= 3; ... } Let's at the commit that added the above: commit 3cf23841b4b7 ("mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()"): Date: Tue Dec 18 14:23:31 2012 -0800 Neil found that if too_many_isolated() returns true while performing direct reclaim we can end up waiting for other threads to complete their direct reclaim. If those threads are allowed to enter the FS or IO to free memory, but this thread is not, then it is possible that those threads will be waiting on this thread and so we get a circular deadlock. some task enters direct reclaim with GFP_KERNEL => too_many_isolated() false => vmscan and run into dirty pages => pageout() => take some FS lock => fs/block code does GFP_NOIO allocation => enter direct reclaim again => too_many_isolated() true => waiting for others to progress, however the other tasks may be circular waiting for the FS lock.. Hmm, how could reclaim be recursive nowadays? __alloc_pages_slowpath() { ... /* Avoid recursion of direct reclaim */ if (current->flags & PF_MEMALLOC) goto nopage; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim() ... } Let's assume it still could, do you remember the following commit? commit db73ee0d4637 ("mm, vmscan: do not loop on too_many_isolated for ever") Date: Wed Sep 6 16:21:11 2017 -0700 If too_many_isolated() does loop forever anymore, how could the above deadlock happen? IOW, why would we need the first commit nowadays? If you don't remember the second commit, let me jog your memory: Author: Michal Hocko <mhocko@xxxxxxxx> 3, thinking abstractly A problem hard to solve in one domain can become a walk in the park in another domain. This problem is a perfect example: it's different to solve based on the number of isolated pages; but it becomes a lot easier based on the number of direct reclaimers. But there is a caveat: when we transform to a new domain, we need to preserve the "reclaim target and strength" you mentioned. Fortunately, there is nothing to preserve, because the existing code has none, given that the "__GFP_IO | __GFP_FS" check in too_many_isolated() is obsolete. Does it make sense?