On Thu, Oct 31, 2024 at 2:00 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > On Fri, Nov 1, 2024 at 5:00 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > > > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > > >>>> I am not sure that the approach we are trying in this patch is the right way: > > > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > > > > > >>>> then the mitigation being tried here wont apply. > > > > > >>> > > > > > >>> Is the problem reproducible without memcg? I imagine only if the > > > > > >>> entire system is under memory pressure. I guess we would want the same > > > > > >>> "mitigation" either way. > > > > > >>> > > > > > >> What would be a good open source benchmark/workload to test without limiting memory > > > > > >> in memcg? > > > > > >> For the kernel build test, I can only get zswap activity to happen if I build > > > > > >> in cgroup and limit memory.max. > > > > > > > > > > > > You mean a benchmark that puts the entire system under memory > > > > > > pressure? I am not sure, it ultimately depends on the size of memory > > > > > > you have, among other factors. > > > > > > > > > > > > What if you run the kernel build test in a VM? Then you can limit is > > > > > > size like a memcg, although you'd probably need to leave more room > > > > > > because the entire guest OS will also subject to the same limit. > > > > > > > > > > > > > > > > I had tried this, but the variance in time/zswap numbers was very high. > > > > > Much higher than the AMD numbers I posted in reply to Barry. So found > > > > > it very difficult to make comparison. > > > > > > > > Hmm yeah maybe more factors come into play with global memory > > > > pressure. I am honestly not sure how to test this scenario, and I > > > > suspect variance will be high anyway. > > > > > > > > We can just try to use whatever technique we use for the memcg limit > > > > though, if possible, right? > > > > > > You can boot a physical machine with mem=1G on the commandline, which > > > restricts the physical range of memory that will be initialized. > > > Double check /proc/meminfo after boot, because part of that physical > > > range might not be usable RAM. > > > > > > I do this quite often to test physical memory pressure with workloads > > > that don't scale up easily, like kernel builds. > > > > > > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > > > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window > > > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > > > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > > > > > >>>> well. > > > > > > +1 > > > > > > I also think there is too much focus on cgroup alone. The bigger issue > > > seems to be how much optimistic volume we swap in when we're under > > > pressure already. This applies to large folios and readahead; global > > > memory availability and cgroup limits. > > > > Agreed, although the characteristics of large folios and readahead are > > different. But yeah, different flavors of the same problem. > > > > > > > > It happens to manifest with THP in cgroups because that's what you > > > guys are testing. But IMO, any solution to this problem should > > > consider the wider scope. > > > > +1, and I really think this should be addressed separately, not just > > rely on large block compression/decompression to offset the cost. It's > > probably not just a zswap/zram problem anyway, it just happens to be > > what we support large folio swapin for. > > Agreed these are two separate issues and should be both investigated > though 2 can offset the cost of 1. > 1. swap thrashing > 2. large block compression/decompression > > For point 1, we likely want to investigate the following: > > 1. if we can see the same thrashing if we always perform readahead > (rapidly filling > the memcg to full again after reclamation). > > 2. Whether there are any issues with balancing file and anon memory > reclamation. > > The 'refault feedback loop' in mglru compares refault rates between anon and > file pages to decide which type should be prioritized for reclamation. > > type = get_type_to_scan(lruvec, swappiness, &tier); > > static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int > *tier_idx) > { > ... > read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); > read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); > type = positive_ctrl_err(&sp, &pv); > > read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); > for (tier = 1; tier < MAX_NR_TIERS; tier++) { > read_ctrl_pos(lruvec, type, tier, gain[type], &pv); > if (!positive_ctrl_err(&sp, &pv)) > break; > } > > *tier_idx = tier - 1; > return type; > } > > In this case, we may want to investigate whether reclamation is primarily > targeting anonymous memory due to potential errors in the statistics path > after mTHP is involved. > > 3. Determine if this is a memcg-specific issue by setting mem=1GB and > running the same test on the global system. > > Yosry, Johannes, Usama, > Is there anything else that might interest us? > > I'll get back to you after completing the investigation mentioned above. Thanks for looking into this. Perhaps a naive question, but is this only related to swap faults? Can the same scenario happen with other types of faults allocating large folios (e.g. faulting in a file page, or a new anon allocation)? Do swap faults use a different policy for determining the folio order, or is it just the swap faults are naturally more correlated to memory pressure, so that's how the issue was surfaced?