Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 30 Oct 2024 14:18:09 -0700

On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 30/10/2024 21:01, Yosry Ahmed wrote:
> > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 30/10/2024 19:51, Yosry Ahmed wrote:
> >>> [..]
> >>>>> My second point about the mitigation is as follows: For a system (or
> >>>>> memcg) under severe memory pressure, especially one without hardware TLB
> >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless
> >>>>> of optimization. Could the mitigation code help in automatically tuning
> >>>>> this fragmentation?
> >>>>>
> >>>>
> >>>> I agree with the point that enabling mTHP always is not the right thing to do
> >>>> on all platforms. I also think it might be the case that enabling mTHP
> >>>> might be a good thing for some workloads, but enabling mTHP swapin along with
> >>>> it might not.
> >>>>
> >>>> As you said when you have apps switching between foreground and background
> >>>> in android, it probably makes sense to have large folio swapping, as you
> >>>> want to bringin all the pages from background app as quickly as possible.
> >>>> And also all the TLB optimizations and smaller lru overhead you get after
> >>>> you have brought in all the pages.
> >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization
> >>>> and smaller lru overhead, as probably the pages are very short lived. So I
> >>>> think it doesnt show the benefit of large folio swapin properly and
> >>>> large folio swapin should probably be disabled for this kind of workload,
> >>>> eventhough mTHP should be enabled.
> >>>>
> >>>> I am not sure that the approach we are trying in this patch is the right way:
> >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> >>>> then the mitigation being tried here wont apply.
> >>>
> >>> Is the problem reproducible without memcg? I imagine only if the
> >>> entire system is under memory pressure. I guess we would want the same
> >>> "mitigation" either way.
> >>>
> >> What would be a good open source benchmark/workload to test without limiting memory
> >> in memcg?
> >> For the kernel build test, I can only get zswap activity to happen if I build
> >> in cgroup and limit memory.max.
> >
> > You mean a benchmark that puts the entire system under memory
> > pressure? I am not sure, it ultimately depends on the size of memory
> > you have, among other factors.
> >
> > What if you run the kernel build test in a VM? Then you can limit is
> > size like a memcg, although you'd probably need to leave more room
> > because the entire guest OS will also subject to the same limit.
> >
>
> I had tried this, but the variance in time/zswap numbers was very high.
> Much higher than the AMD numbers I posted in reply to Barry. So found
> it very difficult to make comparison.

Hmm yeah maybe more factors come into play with global memory
pressure. I am honestly not sure how to test this scenario, and I
suspect variance will be high anyway.

We can just try to use whatever technique we use for the memcg limit
though, if possible, right?

>
> >>
> >> I can just run zswap large folio zswapin in production and see, but that will take me a few
> >> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
> >> then maybe its not really an issue? I believe Barry doesnt see an issue in android
> >> phones (but please correct me if I am wrong), and if there isnt an issue in Meta
> >> production as well, its a good data point for servers as well. And maybe
> >> kernel build in 4G memcg is not a good test.
> >
> > If there is a regression in the kernel build, this means some
> > workloads may be affected, even if Meta's prod isn't. I understand
> > that the benchmark is not very representative of real world workloads,
> > but in this instance I think the thrashing problem surfaced by the
> > benchmark is real.
> >
> >>
> >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> >>>> issue? If we zswap (without the large folio swapin series) and change the window
> >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >>>> well.
> >>>
> >>> I think large folio swapin would make the problem worse anyway. I am
> >>> also not sure if the readahead window adjusts on memory pressure or
> >>> not.
> >>>
> >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> >> seen here as there would be in swapin_readahead?
> >
> > Maybe readahead is not as aggressive in general as large folio
> > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > of the window is the smaller of page_cluster (2 or 3) and
> > SWAP_RA_ORDER_CEILING (5).
> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> be similar to enabling 32K mTHP?

Not quite.

>
> >
> > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > contiguous allocation like large folio swapin. So that could be
> > another factor why readahead may not reproduce the problem.

Because of this ^.