Re: [RFC] mm/vmscan.c: avoid possible long latency caused by too_many_isolated()

Yu Zhao <yuzhao@xxxxxxxxxx> · Fri, 30 Apr 2021 02:34:28 -0600

On Thu, Apr 29, 2021 at 4:00 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Wed 28-04-21 09:05:06, Yu Zhao wrote:
> > On Wed, Apr 28, 2021 at 5:55 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> [...]
> > > > @@ -3334,8 +3285,17 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > > >       set_task_reclaim_state(current, &sc.reclaim_state);
> > > >       trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
> > > >
> > > > +     nr_cpus = current_is_kswapd() ? 0 : num_online_cpus();
> > > > +     while (nr_cpus && !atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) {
> > > > +             if (schedule_timeout_killable(HZ / 10))
> > > > +                     return SWAP_CLUSTER_MAX;
> > > > +     }
> > > > +
> > > >       nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> > > >
> > > > +     if (nr_cpus)
> > > > +             atomic_dec(&pgdat->nr_reclaimers);
> > > > +
> > > >       trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
> > > >       set_task_reclaim_state(current, NULL);
> > >
> > > This will surely break any memcg direct reclaim.
> >
> > Mind elaborating how it will "surely" break any memcg direct reclaim?
>
> I was wrong here. I though this is done in a common path for all direct
> reclaimers (likely mixed up try_to_free_pages with do_try_free_pages).
> Sorry about the confusion.
>
> Still, I do not think that the above heuristic will work properly.
> Different reclaimers have a different reclaim target (e.g. lower zones
> and/or numa node mask) and strength (e.g.  GFP_NOFS vs. GFP_KERNEL). A
> simple count based throttling would be be prone to different sorts of
> priority inversions.

I see where your concern is coming from. Let's look at it from
multiple angles, and hopefully this will clear things up.

1, looking into this approach:
This approach limits the number of direct reclaimers without any bias.
It doesn't favor or disfavor anybody. IOW, everyone has an equal
chance to run, regardless of the reclaim parameters. So where does the
inversion come from?

2, comparing it with the existing code:
Both try to limit direct reclaims,: one by the number of isolated
pages and the other by the number of concurrent direct reclaimers.
Neither numbers are correlated with any parameters you mentioned above
except the following:

too_many_isolated()
{
...
        /*
         * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
         * won't get blocked by normal direct-reclaimers, forming a circular
         * deadlock.
         */
        if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
                inactive >>= 3;
...
}

Let's at the commit that added the above:

commit 3cf23841b4b7 ("mm/vmscan.c: avoid possible deadlock caused by
too_many_isolated()"):
Date:   Tue Dec 18 14:23:31 2012 -0800

    Neil found that if too_many_isolated() returns true while performing
    direct reclaim we can end up waiting for other threads to complete their
    direct reclaim.  If those threads are allowed to enter the FS or IO to
    free memory, but this thread is not, then it is possible that those
    threads will be waiting on this thread and so we get a circular deadlock.

    some task enters direct reclaim with GFP_KERNEL
      => too_many_isolated() false
        => vmscan and run into dirty pages
          => pageout()
            => take some FS lock
              => fs/block code does GFP_NOIO allocation
                => enter direct reclaim again
                  => too_many_isolated() true
                    => waiting for others to progress, however the other
                       tasks may be circular waiting for the FS lock..

Hmm, how could reclaim be recursive nowadays?

__alloc_pages_slowpath()
{
...

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim()
...
}

Let's assume it still could, do you remember the following commit?

commit db73ee0d4637 ("mm, vmscan: do not loop on too_many_isolated for ever")
Date:   Wed Sep 6 16:21:11 2017 -0700

If too_many_isolated() does loop forever anymore, how could the above
deadlock happen? IOW, why would we need the first commit nowadays?

If you don't remember the second commit, let me jog your memory:

Author: Michal Hocko <mhocko@xxxxxxxx>

3, thinking abstractly
A problem hard to solve in one domain can become a walk in the park in
another domain. This problem is a perfect example: it's different to
solve based on the number of isolated pages; but it becomes a lot
easier based on the number of direct reclaimers.

But there is a caveat: when we transform to a new domain, we need to
preserve the "reclaim target and strength" you mentioned. Fortunately,
there is nothing to preserve, because the existing code has none,
given that the "__GFP_IO | __GFP_FS" check in too_many_isolated() is
obsolete.

Does it make sense?