On 30/10/2024 19:51, Yosry Ahmed wrote: > [..] >>> My second point about the mitigation is as follows: For a system (or >>> memcg) under severe memory pressure, especially one without hardware TLB >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>> a larger granularity, some internal fragmentation is unavoidable, regardless >>> of optimization. Could the mitigation code help in automatically tuning >>> this fragmentation? >>> >> >> I agree with the point that enabling mTHP always is not the right thing to do >> on all platforms. I also think it might be the case that enabling mTHP >> might be a good thing for some workloads, but enabling mTHP swapin along with >> it might not. >> >> As you said when you have apps switching between foreground and background >> in android, it probably makes sense to have large folio swapping, as you >> want to bringin all the pages from background app as quickly as possible. >> And also all the TLB optimizations and smaller lru overhead you get after >> you have brought in all the pages. >> Linux kernel build test doesnt really get to benefit from the TLB optimization >> and smaller lru overhead, as probably the pages are very short lived. So I >> think it doesnt show the benefit of large folio swapin properly and >> large folio swapin should probably be disabled for this kind of workload, >> eventhough mTHP should be enabled. >> >> I am not sure that the approach we are trying in this patch is the right way: >> - This patch makes it a memcg issue, but you could have memcg disabled and >> then the mitigation being tried here wont apply. > > Is the problem reproducible without memcg? I imagine only if the > entire system is under memory pressure. I guess we would want the same > "mitigation" either way. > What would be a good open source benchmark/workload to test without limiting memory in memcg? For the kernel build test, I can only get zswap activity to happen if I build in cgroup and limit memory.max. I can just run zswap large folio zswapin in production and see, but that will take me a few days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, then maybe its not really an issue? I believe Barry doesnt see an issue in android phones (but please correct me if I am wrong), and if there isnt an issue in Meta production as well, its a good data point for servers as well. And maybe kernel build in 4G memcg is not a good test. >> - Instead of this being a large folio swapin issue, is it more of a readahead >> issue? If we zswap (without the large folio swapin series) and change the window >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> when cgroup memory is limited as readahead would probably cause swap thrashing as >> well. > > I think large folio swapin would make the problem worse anyway. I am > also not sure if the readahead window adjusts on memory pressure or > not. > readahead window doesnt look at memory pressure. So maybe the same thing is being seen here as there would be in swapin_readahead? Maybe if we check kernel build test performance in 4G memcg with below diff, it might get better? diff --git a/mm/swap_state.c b/mm/swap_state.c index 4669f29cf555..9e196e1e6885 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, pgoff_t ilx; bool page_allocated; - win = swap_vma_ra_win(vmf, &start, &end); + win = 1; if (win == 1) goto skip; >> - Instead of looking at cgroup margin, maybe we should try and look at >> the rate of change of workingset_restore_anon? This might be a lot more complicated >> to do, but probably is the right metric to determine swap thrashing. It also means >> that this could be used in both the synchronous swapcache skipping path and >> swapin_readahead path. >> (Thanks Johannes for suggesting this) >> >> With the large folio swapin, I do see the large improvement when considering only >> swapin performance and latency in the same way as you saw in zram. >> Maybe the right short term approach is to have >> /sys/kernel/mm/transparent_hugepage/swapin >> and have that disabled by default to avoid regression. >> If the workload owner sees a benefit, they can enable it. >> I can add this when sending the next version of large folio zswapin if that makes >> sense? > > I would honestly prefer we avoid this if possible. It's always easy to > just put features behind knobs, and then users have the toil of > figuring out if/when they can use it, or just give up. We should find > a way to avoid the thrashing due to hitting the memcg limit (or being > under global memory pressure), it seems like something the kernel > should be able to do on its own. > >> Longer term I can try and have a look at if we can do something with >> workingset_restore_anon to improve things. > > I am not a big fan of this, mainly because reading a stat from the > kernel puts us in a situation where we have to choose between: > - Doing a memcg stats flush in the kernel, which is something we are > trying to move away from due to various problems we have been running > into. > - Using potentially stale stats (up to 2s), which may be fine but is > suboptimal at best. We may have blips of thrashing due to stale stats > not showing the refaults.