On 04/11/2024 06:42, Huang, Ying wrote: > Johannes Weiner <hannes@xxxxxxxxxxx> writes: > >> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>> On 30/10/2024 21:01, Yosry Ahmed wrote: >>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>>>>>> I am not sure that the approach we are trying in this patch is the right way: >>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>>>>>> then the mitigation being tried here wont apply. >>>>>>> >>>>>>> Is the problem reproducible without memcg? I imagine only if the >>>>>>> entire system is under memory pressure. I guess we would want the same >>>>>>> "mitigation" either way. >>>>>>> >>>>>> What would be a good open source benchmark/workload to test without limiting memory >>>>>> in memcg? >>>>>> For the kernel build test, I can only get zswap activity to happen if I build >>>>>> in cgroup and limit memory.max. >>>>> >>>>> You mean a benchmark that puts the entire system under memory >>>>> pressure? I am not sure, it ultimately depends on the size of memory >>>>> you have, among other factors. >>>>> >>>>> What if you run the kernel build test in a VM? Then you can limit is >>>>> size like a memcg, although you'd probably need to leave more room >>>>> because the entire guest OS will also subject to the same limit. >>>>> >>>> >>>> I had tried this, but the variance in time/zswap numbers was very high. >>>> Much higher than the AMD numbers I posted in reply to Barry. So found >>>> it very difficult to make comparison. >>> >>> Hmm yeah maybe more factors come into play with global memory >>> pressure. I am honestly not sure how to test this scenario, and I >>> suspect variance will be high anyway. >>> >>> We can just try to use whatever technique we use for the memcg limit >>> though, if possible, right? >> >> You can boot a physical machine with mem=1G on the commandline, which >> restricts the physical range of memory that will be initialized. >> Double check /proc/meminfo after boot, because part of that physical >> range might not be usable RAM. >> >> I do this quite often to test physical memory pressure with workloads >> that don't scale up easily, like kernel builds. >> >>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window >>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>>>>>> well. >> >> +1 >> >> I also think there is too much focus on cgroup alone. The bigger issue >> seems to be how much optimistic volume we swap in when we're under >> pressure already. This applies to large folios and readahead; global >> memory availability and cgroup limits. > > The current swap readahead logic is something like, > > 1. try readahead some pages for sequential access pattern, mark them as > readahead > > 2. if these readahead pages get accessed before swapped out again, > increase 'hits' counter > > 3. for next swap in, try readahead 'hits' pages and clear 'hits'. > > So, if there's heavy memory pressure, the readaheaded pages will not be > accessed before being swapped out again (in 2 above), the readahead > pages will be minimal. > > IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in > addition to the pages accessed are swapped in, the adjacent pages are > swapped in (swap readahead) too. If these readahead pages are not > accessed before swapped out again, system runs into more severe > thrashing. This is because we lack the swap readahead window scaling > mechanism as above. And, this is why I suggested to combine the swap > readahead mechanism and mTHP swap-in by default before. That is, when > kernel swaps in a page, it checks current swap readahead window, and > decides mTHP order according to window size. So, if there are heavy > memory pressure, so that the nearby pages will not be accessed before > being swapped out again, the mTHP swap-in order can be adjusted > automatically. This is a good idea to do, but I think the issue is that readahead is a folio flag and not a page flag, so only works when folio size is 1. In the swapin_readahead swapcache path, the current implementation decides the ra_window based on hits, which is incremented in swap_cache_get_folio if it has not been gotten from swapcache before. The problem would be that we need information on how many distinct pages in a large folio that has been swapped in have been accessed to decide the hits/window size, which I don't think is possible. As once the entire large folio has been swapped in, we won't get a fault. > >> It happens to manifest with THP in cgroups because that's what you >> guys are testing. But IMO, any solution to this problem should >> consider the wider scope. >> >>>>>>> I think large folio swapin would make the problem worse anyway. I am >>>>>>> also not sure if the readahead window adjusts on memory pressure or >>>>>>> not. >>>>>>> >>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being >>>>>> seen here as there would be in swapin_readahead? >>>>> >>>>> Maybe readahead is not as aggressive in general as large folio >>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >>>>> of the window is the smaller of page_cluster (2 or 3) and >>>>> SWAP_RA_ORDER_CEILING (5). >>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might >>>> be similar to enabling 32K mTHP? >>> >>> Not quite. >> >> Actually, I would expect it to be... > > Me too. > >>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a >>>>> contiguous allocation like large folio swapin. So that could be >>>>> another factor why readahead may not reproduce the problem. >>> >>> Because of this ^. >> >> ...this matters for the physical allocation, which might require more >> reclaim and compaction to produce the 32k. But an earlier version of >> Barry's patch did the cgroup margin fallback after the THP was already >> physically allocated, and it still helped. >> >> So the issue in this test scenario seems to be mostly about cgroup >> volume. And then 8 4k charges should be equivalent to a singular 32k >> charge when it comes to cgroup pressure. > > -- > Best Regards, > Huang, Ying