On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Usama Arif <usamaarif642@xxxxxxxxx> writes: > > > On 04/11/2024 06:42, Huang, Ying wrote: > >> Johannes Weiner <hannes@xxxxxxxxxxx> writes: > >> > >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote: > >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way: > >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and > >>>>>>>>> then the mitigation being tried here wont apply. > >>>>>>>> > >>>>>>>> Is the problem reproducible without memcg? I imagine only if the > >>>>>>>> entire system is under memory pressure. I guess we would want the same > >>>>>>>> "mitigation" either way. > >>>>>>>> > >>>>>>> What would be a good open source benchmark/workload to test without limiting memory > >>>>>>> in memcg? > >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build > >>>>>>> in cgroup and limit memory.max. > >>>>>> > >>>>>> You mean a benchmark that puts the entire system under memory > >>>>>> pressure? I am not sure, it ultimately depends on the size of memory > >>>>>> you have, among other factors. > >>>>>> > >>>>>> What if you run the kernel build test in a VM? Then you can limit is > >>>>>> size like a memcg, although you'd probably need to leave more room > >>>>>> because the entire guest OS will also subject to the same limit. > >>>>>> > >>>>> > >>>>> I had tried this, but the variance in time/zswap numbers was very high. > >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found > >>>>> it very difficult to make comparison. > >>>> > >>>> Hmm yeah maybe more factors come into play with global memory > >>>> pressure. I am honestly not sure how to test this scenario, and I > >>>> suspect variance will be high anyway. > >>>> > >>>> We can just try to use whatever technique we use for the memcg limit > >>>> though, if possible, right? > >>> > >>> You can boot a physical machine with mem=1G on the commandline, which > >>> restricts the physical range of memory that will be initialized. > >>> Double check /proc/meminfo after boot, because part of that physical > >>> range might not be usable RAM. > >>> > >>> I do this quite often to test physical memory pressure with workloads > >>> that don't scale up easily, like kernel builds. > >>> > >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead > >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window > >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > >>>>>>>>> well. > >>> > >>> +1 > >>> > >>> I also think there is too much focus on cgroup alone. The bigger issue > >>> seems to be how much optimistic volume we swap in when we're under > >>> pressure already. This applies to large folios and readahead; global > >>> memory availability and cgroup limits. > >> > >> The current swap readahead logic is something like, > >> > >> 1. try readahead some pages for sequential access pattern, mark them as > >> readahead > >> > >> 2. if these readahead pages get accessed before swapped out again, > >> increase 'hits' counter > >> > >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. > >> > >> So, if there's heavy memory pressure, the readaheaded pages will not be > >> accessed before being swapped out again (in 2 above), the readahead > >> pages will be minimal. > >> > >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in > >> addition to the pages accessed are swapped in, the adjacent pages are > >> swapped in (swap readahead) too. If these readahead pages are not > >> accessed before swapped out again, system runs into more severe > >> thrashing. This is because we lack the swap readahead window scaling > >> mechanism as above. And, this is why I suggested to combine the swap > >> readahead mechanism and mTHP swap-in by default before. That is, when > >> kernel swaps in a page, it checks current swap readahead window, and > >> decides mTHP order according to window size. So, if there are heavy > >> memory pressure, so that the nearby pages will not be accessed before > >> being swapped out again, the mTHP swap-in order can be adjusted > >> automatically. > > > > This is a good idea to do, but I think the issue is that readahead > > is a folio flag and not a page flag, so only works when folio size is 1. > > > > In the swapin_readahead swapcache path, the current implementation decides > > the ra_window based on hits, which is incremented in swap_cache_get_folio > > if it has not been gotten from swapcache before. > > The problem would be that we need information on how many distinct pages in > > a large folio that has been swapped in have been accessed to decide the > > hits/window size, which I don't think is possible. As once the entire large > > folio has been swapped in, we won't get a fault. > > > > To do that, we need to move readahead flag to per-page from per-folio. > And we need to map only the accessed page of the folio in page fault > handler. This may impact performance. So, we may only do that for > sampled folios only, for example, every 100 folios. I'm not entirely sure there's a chance to gain traction on this, as the current trend clearly leans toward moving flags from page to folio, not from folio to page :-) > > >> > >>> It happens to manifest with THP in cgroups because that's what you > >>> guys are testing. But IMO, any solution to this problem should > >>> consider the wider scope. > >>> > >>>>>>>> I think large folio swapin would make the problem worse anyway. I am > >>>>>>>> also not sure if the readahead window adjusts on memory pressure or > >>>>>>>> not. > >>>>>>>> > >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being > >>>>>>> seen here as there would be in swapin_readahead? > >>>>>> > >>>>>> Maybe readahead is not as aggressive in general as large folio > >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > >>>>>> of the window is the smaller of page_cluster (2 or 3) and > >>>>>> SWAP_RA_ORDER_CEILING (5). > >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might > >>>>> be similar to enabling 32K mTHP? > >>>> > >>>> Not quite. > >>> > >>> Actually, I would expect it to be... > >> > >> Me too. > >> > >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a > >>>>>> contiguous allocation like large folio swapin. So that could be > >>>>>> another factor why readahead may not reproduce the problem. > >>>> > >>>> Because of this ^. > >>> > >>> ...this matters for the physical allocation, which might require more > >>> reclaim and compaction to produce the 32k. But an earlier version of > >>> Barry's patch did the cgroup margin fallback after the THP was already > >>> physically allocated, and it still helped. > >>> > >>> So the issue in this test scenario seems to be mostly about cgroup > >>> volume. And then 8 4k charges should be equivalent to a singular 32k > >>> charge when it comes to cgroup pressure. > > -- > Best Regards, > Huang, Ying