On 30/10/2024 21:01, Yosry Ahmed wrote: > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> >> >> On 30/10/2024 19:51, Yosry Ahmed wrote: >>> [..] >>>>> My second point about the mitigation is as follows: For a system (or >>>>> memcg) under severe memory pressure, especially one without hardware TLB >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless >>>>> of optimization. Could the mitigation code help in automatically tuning >>>>> this fragmentation? >>>>> >>>> >>>> I agree with the point that enabling mTHP always is not the right thing to do >>>> on all platforms. I also think it might be the case that enabling mTHP >>>> might be a good thing for some workloads, but enabling mTHP swapin along with >>>> it might not. >>>> >>>> As you said when you have apps switching between foreground and background >>>> in android, it probably makes sense to have large folio swapping, as you >>>> want to bringin all the pages from background app as quickly as possible. >>>> And also all the TLB optimizations and smaller lru overhead you get after >>>> you have brought in all the pages. >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization >>>> and smaller lru overhead, as probably the pages are very short lived. So I >>>> think it doesnt show the benefit of large folio swapin properly and >>>> large folio swapin should probably be disabled for this kind of workload, >>>> eventhough mTHP should be enabled. >>>> >>>> I am not sure that the approach we are trying in this patch is the right way: >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>> then the mitigation being tried here wont apply. >>> >>> Is the problem reproducible without memcg? I imagine only if the >>> entire system is under memory pressure. I guess we would want the same >>> "mitigation" either way. >>> >> What would be a good open source benchmark/workload to test without limiting memory >> in memcg? >> For the kernel build test, I can only get zswap activity to happen if I build >> in cgroup and limit memory.max. > > You mean a benchmark that puts the entire system under memory > pressure? I am not sure, it ultimately depends on the size of memory > you have, among other factors. > > What if you run the kernel build test in a VM? Then you can limit is > size like a memcg, although you'd probably need to leave more room > because the entire guest OS will also subject to the same limit. > I had tried this, but the variance in time/zswap numbers was very high. Much higher than the AMD numbers I posted in reply to Barry. So found it very difficult to make comparison. >> >> I can just run zswap large folio zswapin in production and see, but that will take me a few >> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, >> then maybe its not really an issue? I believe Barry doesnt see an issue in android >> phones (but please correct me if I am wrong), and if there isnt an issue in Meta >> production as well, its a good data point for servers as well. And maybe >> kernel build in 4G memcg is not a good test. > > If there is a regression in the kernel build, this means some > workloads may be affected, even if Meta's prod isn't. I understand > that the benchmark is not very representative of real world workloads, > but in this instance I think the thrashing problem surfaced by the > benchmark is real. > >> >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>> issue? If we zswap (without the large folio swapin series) and change the window >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>> well. >>> >>> I think large folio swapin would make the problem worse anyway. I am >>> also not sure if the readahead window adjusts on memory pressure or >>> not. >>> >> readahead window doesnt look at memory pressure. So maybe the same thing is being >> seen here as there would be in swapin_readahead? > > Maybe readahead is not as aggressive in general as large folio > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > of the window is the smaller of page_cluster (2 or 3) and > SWAP_RA_ORDER_CEILING (5). Yes, I was seeing 8 pages swapin (order 3) when testing. So might be similar to enabling 32K mTHP? > > Also readahead will swapin 4k folios AFAICT, so we don't need a > contiguous allocation like large folio swapin. So that could be > another factor why readahead may not reproduce the problem. > >> Maybe if we check kernel build test >> performance in 4G memcg with below diff, it might get better? > > I think you can use the page_cluster tunable to do this at runtime. > >> >> diff --git a/mm/swap_state.c b/mm/swap_state.c >> index 4669f29cf555..9e196e1e6885 100644 >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, >> pgoff_t ilx; >> bool page_allocated; >> >> - win = swap_vma_ra_win(vmf, &start, &end); >> + win = 1; >> if (win == 1) >> goto skip; >>