On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 30/10/2024 19:51, Yosry Ahmed wrote: > > [..] > >>> My second point about the mitigation is as follows: For a system (or > >>> memcg) under severe memory pressure, especially one without hardware TLB > >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at > >>> a larger granularity, some internal fragmentation is unavoidable, regardless > >>> of optimization. Could the mitigation code help in automatically tuning > >>> this fragmentation? > >>> > >> > >> I agree with the point that enabling mTHP always is not the right thing to do > >> on all platforms. I also think it might be the case that enabling mTHP > >> might be a good thing for some workloads, but enabling mTHP swapin along with > >> it might not. > >> > >> As you said when you have apps switching between foreground and background > >> in android, it probably makes sense to have large folio swapping, as you > >> want to bringin all the pages from background app as quickly as possible. > >> And also all the TLB optimizations and smaller lru overhead you get after > >> you have brought in all the pages. > >> Linux kernel build test doesnt really get to benefit from the TLB optimization > >> and smaller lru overhead, as probably the pages are very short lived. So I > >> think it doesnt show the benefit of large folio swapin properly and > >> large folio swapin should probably be disabled for this kind of workload, > >> eventhough mTHP should be enabled. > >> > >> I am not sure that the approach we are trying in this patch is the right way: > >> - This patch makes it a memcg issue, but you could have memcg disabled and > >> then the mitigation being tried here wont apply. > > > > Is the problem reproducible without memcg? I imagine only if the > > entire system is under memory pressure. I guess we would want the same > > "mitigation" either way. > > > What would be a good open source benchmark/workload to test without limiting memory > in memcg? > For the kernel build test, I can only get zswap activity to happen if I build > in cgroup and limit memory.max. You mean a benchmark that puts the entire system under memory pressure? I am not sure, it ultimately depends on the size of memory you have, among other factors. What if you run the kernel build test in a VM? Then you can limit is size like a memcg, although you'd probably need to leave more room because the entire guest OS will also subject to the same limit. > > I can just run zswap large folio zswapin in production and see, but that will take me a few > days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, > then maybe its not really an issue? I believe Barry doesnt see an issue in android > phones (but please correct me if I am wrong), and if there isnt an issue in Meta > production as well, its a good data point for servers as well. And maybe > kernel build in 4G memcg is not a good test. If there is a regression in the kernel build, this means some workloads may be affected, even if Meta's prod isn't. I understand that the benchmark is not very representative of real world workloads, but in this instance I think the thrashing problem surfaced by the benchmark is real. > > >> - Instead of this being a large folio swapin issue, is it more of a readahead > >> issue? If we zswap (without the large folio swapin series) and change the window > >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >> when cgroup memory is limited as readahead would probably cause swap thrashing as > >> well. > > > > I think large folio swapin would make the problem worse anyway. I am > > also not sure if the readahead window adjusts on memory pressure or > > not. > > > readahead window doesnt look at memory pressure. So maybe the same thing is being > seen here as there would be in swapin_readahead? Maybe readahead is not as aggressive in general as large folio swapins? Looking at swap_vma_ra_win(), it seems like the maximum order of the window is the smaller of page_cluster (2 or 3) and SWAP_RA_ORDER_CEILING (5). Also readahead will swapin 4k folios AFAICT, so we don't need a contiguous allocation like large folio swapin. So that could be another factor why readahead may not reproduce the problem. > Maybe if we check kernel build test > performance in 4G memcg with below diff, it might get better? I think you can use the page_cluster tunable to do this at runtime. > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 4669f29cf555..9e196e1e6885 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, > pgoff_t ilx; > bool page_allocated; > > - win = swap_vma_ra_win(vmf, &start, &end); > + win = 1; > if (win == 1) > goto skip; >