Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

Barry Song <21cnbao@xxxxxxxxx> · Tue, 5 Nov 2024 14:13:35 +1300



On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Usama Arif <usamaarif642@xxxxxxxxx> writes:
>
> > On 04/11/2024 06:42, Huang, Ying wrote:
> >> Johannes Weiner <hannes@xxxxxxxxxxx> writes:
> >>
> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
> >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way:
> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> >>>>>>>>> then the mitigation being tried here wont apply.
> >>>>>>>>
> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the
> >>>>>>>> entire system is under memory pressure. I guess we would want the same
> >>>>>>>> "mitigation" either way.
> >>>>>>>>
> >>>>>>> What would be a good open source benchmark/workload to test without limiting memory
> >>>>>>> in memcg?
> >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build
> >>>>>>> in cgroup and limit memory.max.
> >>>>>>
> >>>>>> You mean a benchmark that puts the entire system under memory
> >>>>>> pressure? I am not sure, it ultimately depends on the size of memory
> >>>>>> you have, among other factors.
> >>>>>>
> >>>>>> What if you run the kernel build test in a VM? Then you can limit is
> >>>>>> size like a memcg, although you'd probably need to leave more room
> >>>>>> because the entire guest OS will also subject to the same limit.
> >>>>>>
> >>>>>
> >>>>> I had tried this, but the variance in time/zswap numbers was very high.
> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
> >>>>> it very difficult to make comparison.
> >>>>
> >>>> Hmm yeah maybe more factors come into play with global memory
> >>>> pressure. I am honestly not sure how to test this scenario, and I
> >>>> suspect variance will be high anyway.
> >>>>
> >>>> We can just try to use whatever technique we use for the memcg limit
> >>>> though, if possible, right?
> >>>
> >>> You can boot a physical machine with mem=1G on the commandline, which
> >>> restricts the physical range of memory that will be initialized.
> >>> Double check /proc/meminfo after boot, because part of that physical
> >>> range might not be usable RAM.
> >>>
> >>> I do this quite often to test physical memory pressure with workloads
> >>> that don't scale up easily, like kernel builds.
> >>>
> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window
> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >>>>>>>>> well.
> >>>
> >>> +1
> >>>
> >>> I also think there is too much focus on cgroup alone. The bigger issue
> >>> seems to be how much optimistic volume we swap in when we're under
> >>> pressure already. This applies to large folios and readahead; global
> >>> memory availability and cgroup limits.
> >>
> >> The current swap readahead logic is something like,
> >>
> >> 1. try readahead some pages for sequential access pattern, mark them as
> >>    readahead
> >>
> >> 2. if these readahead pages get accessed before swapped out again,
> >>    increase 'hits' counter
> >>
> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
> >>
> >> So, if there's heavy memory pressure, the readaheaded pages will not be
> >> accessed before being swapped out again (in 2 above), the readahead
> >> pages will be minimal.
> >>
> >> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
> >> addition to the pages accessed are swapped in, the adjacent pages are
> >> swapped in (swap readahead) too.  If these readahead pages are not
> >> accessed before swapped out again, system runs into more severe
> >> thrashing.  This is because we lack the swap readahead window scaling
> >> mechanism as above.  And, this is why I suggested to combine the swap
> >> readahead mechanism and mTHP swap-in by default before.  That is, when
> >> kernel swaps in a page, it checks current swap readahead window, and
> >> decides mTHP order according to window size.  So, if there are heavy
> >> memory pressure, so that the nearby pages will not be accessed before
> >> being swapped out again, the mTHP swap-in order can be adjusted
> >> automatically.
> >
> > This is a good idea to do, but I think the issue is that readahead
> > is a folio flag and not a page flag, so only works when folio size is 1.
> >
> > In the swapin_readahead swapcache path, the current implementation decides
> > the ra_window based on hits, which is incremented in swap_cache_get_folio
> > if it has not been gotten from swapcache before.
> > The problem would be that we need information on how many distinct pages in
> > a large folio that has been swapped in have been accessed to decide the
> > hits/window size, which I don't think is possible. As once the entire large
> > folio has been swapped in, we won't get a fault.
> >
>
> To do that, we need to move readahead flag to per-page from per-folio.
> And we need to map only the accessed page of the folio in page fault
> handler.  This may impact performance.  So, we may only do that for
> sampled folios only, for example, every 100 folios.

I'm not entirely sure there's a chance to gain traction on this, as the current
trend clearly leans toward moving flags from page to folio, not from folio to
page :-)

>
> >>
> >>> It happens to manifest with THP in cgroups because that's what you
> >>> guys are testing. But IMO, any solution to this problem should
> >>> consider the wider scope.
> >>>
> >>>>>>>> I think large folio swapin would make the problem worse anyway. I am
> >>>>>>>> also not sure if the readahead window adjusts on memory pressure or
> >>>>>>>> not.
> >>>>>>>>
> >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being
> >>>>>>> seen here as there would be in swapin_readahead?
> >>>>>>
> >>>>>> Maybe readahead is not as aggressive in general as large folio
> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> >>>>>> of the window is the smaller of page_cluster (2 or 3) and
> >>>>>> SWAP_RA_ORDER_CEILING (5).
> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> >>>>> be similar to enabling 32K mTHP?
> >>>>
> >>>> Not quite.
> >>>
> >>> Actually, I would expect it to be...
> >>
> >> Me too.
> >>
> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
> >>>>>> contiguous allocation like large folio swapin. So that could be
> >>>>>> another factor why readahead may not reproduce the problem.
> >>>>
> >>>> Because of this ^.
> >>>
> >>> ...this matters for the physical allocation, which might require more
> >>> reclaim and compaction to produce the 32k. But an earlier version of
> >>> Barry's patch did the cgroup margin fallback after the THP was already
> >>> physically allocated, and it still helped.
> >>>
> >>> So the issue in this test scenario seems to be mostly about cgroup
> >>> volume. And then 8 4k charges should be equivalent to a singular 32k
> >>> charge when it comes to cgroup pressure.
>
> --
> Best Regards,
> Huang, Ying