On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 30/10/2024 19:51, Yosry Ahmed wrote: > >>> [..] > >>>>> My second point about the mitigation is as follows: For a system (or > >>>>> memcg) under severe memory pressure, especially one without hardware TLB > >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at > >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless > >>>>> of optimization. Could the mitigation code help in automatically tuning > >>>>> this fragmentation? > >>>>> > >>>> > >>>> I agree with the point that enabling mTHP always is not the right thing to do > >>>> on all platforms. I also think it might be the case that enabling mTHP > >>>> might be a good thing for some workloads, but enabling mTHP swapin along with > >>>> it might not. > >>>> > >>>> As you said when you have apps switching between foreground and background > >>>> in android, it probably makes sense to have large folio swapping, as you > >>>> want to bringin all the pages from background app as quickly as possible. > >>>> And also all the TLB optimizations and smaller lru overhead you get after > >>>> you have brought in all the pages. > >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization > >>>> and smaller lru overhead, as probably the pages are very short lived. So I > >>>> think it doesnt show the benefit of large folio swapin properly and > >>>> large folio swapin should probably be disabled for this kind of workload, > >>>> eventhough mTHP should be enabled. > >>>> > >>>> I am not sure that the approach we are trying in this patch is the right way: > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > >>>> then the mitigation being tried here wont apply. > >>> > >>> Is the problem reproducible without memcg? I imagine only if the > >>> entire system is under memory pressure. I guess we would want the same > >>> "mitigation" either way. > >>> > >> What would be a good open source benchmark/workload to test without limiting memory > >> in memcg? > >> For the kernel build test, I can only get zswap activity to happen if I build > >> in cgroup and limit memory.max. > > > > You mean a benchmark that puts the entire system under memory > > pressure? I am not sure, it ultimately depends on the size of memory > > you have, among other factors. > > > > What if you run the kernel build test in a VM? Then you can limit is > > size like a memcg, although you'd probably need to leave more room > > because the entire guest OS will also subject to the same limit. > > > > I had tried this, but the variance in time/zswap numbers was very high. > Much higher than the AMD numbers I posted in reply to Barry. So found > it very difficult to make comparison. Hmm yeah maybe more factors come into play with global memory pressure. I am honestly not sure how to test this scenario, and I suspect variance will be high anyway. We can just try to use whatever technique we use for the memcg limit though, if possible, right? > > >> > >> I can just run zswap large folio zswapin in production and see, but that will take me a few > >> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, > >> then maybe its not really an issue? I believe Barry doesnt see an issue in android > >> phones (but please correct me if I am wrong), and if there isnt an issue in Meta > >> production as well, its a good data point for servers as well. And maybe > >> kernel build in 4G memcg is not a good test. > > > > If there is a regression in the kernel build, this means some > > workloads may be affected, even if Meta's prod isn't. I understand > > that the benchmark is not very representative of real world workloads, > > but in this instance I think the thrashing problem surfaced by the > > benchmark is real. > > > >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > >>>> issue? If we zswap (without the large folio swapin series) and change the window > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > >>>> well. > >>> > >>> I think large folio swapin would make the problem worse anyway. I am > >>> also not sure if the readahead window adjusts on memory pressure or > >>> not. > >>> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being > >> seen here as there would be in swapin_readahead? > > > > Maybe readahead is not as aggressive in general as large folio > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > > of the window is the smaller of page_cluster (2 or 3) and > > SWAP_RA_ORDER_CEILING (5). > Yes, I was seeing 8 pages swapin (order 3) when testing. So might > be similar to enabling 32K mTHP? Not quite. > > > > > Also readahead will swapin 4k folios AFAICT, so we don't need a > > contiguous allocation like large folio swapin. So that could be > > another factor why readahead may not reproduce the problem. Because of this ^.