Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Fri, 1 Nov 2024 09:19:29 -0700

On Thu, Oct 31, 2024 at 2:00 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Fri, Nov 1, 2024 at 5:00 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> > > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> > > > > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> > > > > >>>> I am not sure that the approach we are trying in this patch is the right way:
> > > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> > > > > >>>> then the mitigation being tried here wont apply.
> > > > > >>>
> > > > > >>> Is the problem reproducible without memcg? I imagine only if the
> > > > > >>> entire system is under memory pressure. I guess we would want the same
> > > > > >>> "mitigation" either way.
> > > > > >>>
> > > > > >> What would be a good open source benchmark/workload to test without limiting memory
> > > > > >> in memcg?
> > > > > >> For the kernel build test, I can only get zswap activity to happen if I build
> > > > > >> in cgroup and limit memory.max.
> > > > > >
> > > > > > You mean a benchmark that puts the entire system under memory
> > > > > > pressure? I am not sure, it ultimately depends on the size of memory
> > > > > > you have, among other factors.
> > > > > >
> > > > > > What if you run the kernel build test in a VM? Then you can limit is
> > > > > > size like a memcg, although you'd probably need to leave more room
> > > > > > because the entire guest OS will also subject to the same limit.
> > > > > >
> > > > >
> > > > > I had tried this, but the variance in time/zswap numbers was very high.
> > > > > Much higher than the AMD numbers I posted in reply to Barry. So found
> > > > > it very difficult to make comparison.
> > > >
> > > > Hmm yeah maybe more factors come into play with global memory
> > > > pressure. I am honestly not sure how to test this scenario, and I
> > > > suspect variance will be high anyway.
> > > >
> > > > We can just try to use whatever technique we use for the memcg limit
> > > > though, if possible, right?
> > >
> > > You can boot a physical machine with mem=1G on the commandline, which
> > > restricts the physical range of memory that will be initialized.
> > > Double check /proc/meminfo after boot, because part of that physical
> > > range might not be usable RAM.
> > >
> > > I do this quite often to test physical memory pressure with workloads
> > > that don't scale up easily, like kernel builds.
> > >
> > > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> > > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> > > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> > > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> > > > > >>>> well.
> > >
> > > +1
> > >
> > > I also think there is too much focus on cgroup alone. The bigger issue
> > > seems to be how much optimistic volume we swap in when we're under
> > > pressure already. This applies to large folios and readahead; global
> > > memory availability and cgroup limits.
> >
> > Agreed, although the characteristics of large folios and readahead are
> > different. But yeah, different flavors of the same problem.
> >
> > >
> > > It happens to manifest with THP in cgroups because that's what you
> > > guys are testing. But IMO, any solution to this problem should
> > > consider the wider scope.
> >
> > +1, and I really think this should be addressed separately, not just
> > rely on large block compression/decompression to offset the cost. It's
> > probably not just a zswap/zram problem anyway, it just happens to be
> > what we support large folio swapin for.
>
> Agreed these are two separate issues and should be both investigated
> though 2 can offset the cost of 1.
> 1. swap thrashing
> 2. large block compression/decompression
>
> For point 1, we likely want to investigate the following:
>
> 1. if we can see the same thrashing if we always perform readahead
> (rapidly filling
> the memcg to full again after reclamation).
>
> 2. Whether there are any issues with balancing file and anon memory
> reclamation.
>
> The 'refault feedback loop' in mglru compares refault rates between anon and
> file pages to decide which type should be prioritized for reclamation.
>
> type = get_type_to_scan(lruvec, swappiness, &tier);
>
> static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int
> *tier_idx)
> {
>         ...
>         read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
>         read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
>         type = positive_ctrl_err(&sp, &pv);
>
>         read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
>         for (tier = 1; tier < MAX_NR_TIERS; tier++) {
>                 read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
>                 if (!positive_ctrl_err(&sp, &pv))
>                         break;
>         }
>
>         *tier_idx = tier - 1;
>         return type;
> }
>
> In this case, we may want to investigate whether reclamation is primarily
> targeting anonymous memory due to potential errors in the statistics path
> after mTHP is involved.
>
> 3. Determine if this is a memcg-specific issue by setting mem=1GB and
> running the same test on the global system.
>
> Yosry, Johannes, Usama,
> Is there anything else that might interest us?
>
> I'll get back to you after completing the investigation mentioned above.

Thanks for looking into this.

Perhaps a naive question, but is this only related to swap faults? Can
the same scenario happen with other types of faults allocating large
folios (e.g. faulting in a file page, or a new anon allocation)?

Do swap faults use a different policy for determining the folio order,
or is it just the swap faults are naturally more correlated to memory
pressure, so that's how the issue was surfaced?