[..] > > My second point about the mitigation is as follows: For a system (or > > memcg) under severe memory pressure, especially one without hardware TLB > > optimization, is enabling mTHP always the right choice? Since mTHP operates at > > a larger granularity, some internal fragmentation is unavoidable, regardless > > of optimization. Could the mitigation code help in automatically tuning > > this fragmentation? > > > > I agree with the point that enabling mTHP always is not the right thing to do > on all platforms. I also think it might be the case that enabling mTHP > might be a good thing for some workloads, but enabling mTHP swapin along with > it might not. > > As you said when you have apps switching between foreground and background > in android, it probably makes sense to have large folio swapping, as you > want to bringin all the pages from background app as quickly as possible. > And also all the TLB optimizations and smaller lru overhead you get after > you have brought in all the pages. > Linux kernel build test doesnt really get to benefit from the TLB optimization > and smaller lru overhead, as probably the pages are very short lived. So I > think it doesnt show the benefit of large folio swapin properly and > large folio swapin should probably be disabled for this kind of workload, > eventhough mTHP should be enabled. > > I am not sure that the approach we are trying in this patch is the right way: > - This patch makes it a memcg issue, but you could have memcg disabled and > then the mitigation being tried here wont apply. Is the problem reproducible without memcg? I imagine only if the entire system is under memory pressure. I guess we would want the same "mitigation" either way. > - Instead of this being a large folio swapin issue, is it more of a readahead > issue? If we zswap (without the large folio swapin series) and change the window > to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > when cgroup memory is limited as readahead would probably cause swap thrashing as > well. I think large folio swapin would make the problem worse anyway. I am also not sure if the readahead window adjusts on memory pressure or not. > - Instead of looking at cgroup margin, maybe we should try and look at > the rate of change of workingset_restore_anon? This might be a lot more complicated > to do, but probably is the right metric to determine swap thrashing. It also means > that this could be used in both the synchronous swapcache skipping path and > swapin_readahead path. > (Thanks Johannes for suggesting this) > > With the large folio swapin, I do see the large improvement when considering only > swapin performance and latency in the same way as you saw in zram. > Maybe the right short term approach is to have > /sys/kernel/mm/transparent_hugepage/swapin > and have that disabled by default to avoid regression. > If the workload owner sees a benefit, they can enable it. > I can add this when sending the next version of large folio zswapin if that makes > sense? I would honestly prefer we avoid this if possible. It's always easy to just put features behind knobs, and then users have the toil of figuring out if/when they can use it, or just give up. We should find a way to avoid the thrashing due to hitting the memcg limit (or being under global memory pressure), it seems like something the kernel should be able to do on its own. > Longer term I can try and have a look at if we can do something with > workingset_restore_anon to improve things. I am not a big fan of this, mainly because reading a stat from the kernel puts us in a situation where we have to choose between: - Doing a memcg stats flush in the kernel, which is something we are trying to move away from due to various problems we have been running into. - Using potentially stale stats (up to 2s), which may be fine but is suboptimal at best. We may have blips of thrashing due to stale stats not showing the refaults.