On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > >>>> I am not sure that the approach we are trying in this patch is the right way: > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > > > >>>> then the mitigation being tried here wont apply. > > > >>> > > > >>> Is the problem reproducible without memcg? I imagine only if the > > > >>> entire system is under memory pressure. I guess we would want the same > > > >>> "mitigation" either way. > > > >>> > > > >> What would be a good open source benchmark/workload to test without limiting memory > > > >> in memcg? > > > >> For the kernel build test, I can only get zswap activity to happen if I build > > > >> in cgroup and limit memory.max. > > > > > > > > You mean a benchmark that puts the entire system under memory > > > > pressure? I am not sure, it ultimately depends on the size of memory > > > > you have, among other factors. > > > > > > > > What if you run the kernel build test in a VM? Then you can limit is > > > > size like a memcg, although you'd probably need to leave more room > > > > because the entire guest OS will also subject to the same limit. > > > > > > > > > > I had tried this, but the variance in time/zswap numbers was very high. > > > Much higher than the AMD numbers I posted in reply to Barry. So found > > > it very difficult to make comparison. > > > > Hmm yeah maybe more factors come into play with global memory > > pressure. I am honestly not sure how to test this scenario, and I > > suspect variance will be high anyway. > > > > We can just try to use whatever technique we use for the memcg limit > > though, if possible, right? > > You can boot a physical machine with mem=1G on the commandline, which > restricts the physical range of memory that will be initialized. > Double check /proc/meminfo after boot, because part of that physical > range might not be usable RAM. > > I do this quite often to test physical memory pressure with workloads > that don't scale up easily, like kernel builds. > > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > > > >>>> well. > > +1 > > I also think there is too much focus on cgroup alone. The bigger issue > seems to be how much optimistic volume we swap in when we're under > pressure already. This applies to large folios and readahead; global > memory availability and cgroup limits. Agreed, although the characteristics of large folios and readahead are different. But yeah, different flavors of the same problem. > > It happens to manifest with THP in cgroups because that's what you > guys are testing. But IMO, any solution to this problem should > consider the wider scope. +1, and I really think this should be addressed separately, not just rely on large block compression/decompression to offset the cost. It's probably not just a zswap/zram problem anyway, it just happens to be what we support large folio swapin for. > > > > >>> I think large folio swapin would make the problem worse anyway. I am > > > >>> also not sure if the readahead window adjusts on memory pressure or > > > >>> not. > > > >>> > > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being > > > >> seen here as there would be in swapin_readahead? > > > > > > > > Maybe readahead is not as aggressive in general as large folio > > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > > > > of the window is the smaller of page_cluster (2 or 3) and > > > > SWAP_RA_ORDER_CEILING (5). > > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might > > > be similar to enabling 32K mTHP? > > > > Not quite. > > Actually, I would expect it to be... > > > > > Also readahead will swapin 4k folios AFAICT, so we don't need a > > > > contiguous allocation like large folio swapin. So that could be > > > > another factor why readahead may not reproduce the problem. > > > > Because of this ^. > > ...this matters for the physical allocation, which might require more > reclaim and compaction to produce the 32k. But an earlier version of > Barry's patch did the cgroup margin fallback after the THP was already > physically allocated, and it still helped. > > So the issue in this test scenario seems to be mostly about cgroup > volume. And then 8 4k charges should be equivalent to a singular 32k > charge when it comes to cgroup pressure. In this test scenario, yes, because it's only exercising cgroup pressure. But if we want a general solution that also addresses global pressure, I expect large folios to be worse because of the contiguity and the size (compared to default readahead window sizes). So I think we shouldn't only test with readahead, as it won't cover some of the large folio cases.