Barry Song <21cnbao@xxxxxxxxx> writes: > On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Usama Arif <usamaarif642@xxxxxxxxx> writes: >> >> > On 04/11/2024 06:42, Huang, Ying wrote: >> >> Johannes Weiner <hannes@xxxxxxxxxxx> writes: >> >> >> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >> >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote: >> >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way: >> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and >> >>>>>>>>> then the mitigation being tried here wont apply. >> >>>>>>>> >> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the >> >>>>>>>> entire system is under memory pressure. I guess we would want the same >> >>>>>>>> "mitigation" either way. >> >>>>>>>> >> >>>>>>> What would be a good open source benchmark/workload to test without limiting memory >> >>>>>>> in memcg? >> >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build >> >>>>>>> in cgroup and limit memory.max. >> >>>>>> >> >>>>>> You mean a benchmark that puts the entire system under memory >> >>>>>> pressure? I am not sure, it ultimately depends on the size of memory >> >>>>>> you have, among other factors. >> >>>>>> >> >>>>>> What if you run the kernel build test in a VM? Then you can limit is >> >>>>>> size like a memcg, although you'd probably need to leave more room >> >>>>>> because the entire guest OS will also subject to the same limit. >> >>>>>> >> >>>>> >> >>>>> I had tried this, but the variance in time/zswap numbers was very high. >> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found >> >>>>> it very difficult to make comparison. >> >>>> >> >>>> Hmm yeah maybe more factors come into play with global memory >> >>>> pressure. I am honestly not sure how to test this scenario, and I >> >>>> suspect variance will be high anyway. >> >>>> >> >>>> We can just try to use whatever technique we use for the memcg limit >> >>>> though, if possible, right? >> >>> >> >>> You can boot a physical machine with mem=1G on the commandline, which >> >>> restricts the physical range of memory that will be initialized. >> >>> Double check /proc/meminfo after boot, because part of that physical >> >>> range might not be usable RAM. >> >>> >> >>> I do this quite often to test physical memory pressure with workloads >> >>> that don't scale up easily, like kernel builds. >> >>> >> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead >> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window >> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >> >>>>>>>>> well. >> >>> >> >>> +1 >> >>> >> >>> I also think there is too much focus on cgroup alone. The bigger issue >> >>> seems to be how much optimistic volume we swap in when we're under >> >>> pressure already. This applies to large folios and readahead; global >> >>> memory availability and cgroup limits. >> >> >> >> The current swap readahead logic is something like, >> >> >> >> 1. try readahead some pages for sequential access pattern, mark them as >> >> readahead >> >> >> >> 2. if these readahead pages get accessed before swapped out again, >> >> increase 'hits' counter >> >> >> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. >> >> >> >> So, if there's heavy memory pressure, the readaheaded pages will not be >> >> accessed before being swapped out again (in 2 above), the readahead >> >> pages will be minimal. >> >> >> >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in >> >> addition to the pages accessed are swapped in, the adjacent pages are >> >> swapped in (swap readahead) too. If these readahead pages are not >> >> accessed before swapped out again, system runs into more severe >> >> thrashing. This is because we lack the swap readahead window scaling >> >> mechanism as above. And, this is why I suggested to combine the swap >> >> readahead mechanism and mTHP swap-in by default before. That is, when >> >> kernel swaps in a page, it checks current swap readahead window, and >> >> decides mTHP order according to window size. So, if there are heavy >> >> memory pressure, so that the nearby pages will not be accessed before >> >> being swapped out again, the mTHP swap-in order can be adjusted >> >> automatically. >> > >> > This is a good idea to do, but I think the issue is that readahead >> > is a folio flag and not a page flag, so only works when folio size is 1. >> > >> > In the swapin_readahead swapcache path, the current implementation decides >> > the ra_window based on hits, which is incremented in swap_cache_get_folio >> > if it has not been gotten from swapcache before. >> > The problem would be that we need information on how many distinct pages in >> > a large folio that has been swapped in have been accessed to decide the >> > hits/window size, which I don't think is possible. As once the entire large >> > folio has been swapped in, we won't get a fault. >> > >> >> To do that, we need to move readahead flag to per-page from per-folio. >> And we need to map only the accessed page of the folio in page fault >> handler. This may impact performance. So, we may only do that for >> sampled folios only, for example, every 100 folios. > > I'm not entirely sure there's a chance to gain traction on this, as the current > trend clearly leans toward moving flags from page to folio, not from folio to > page :-) This may be a problem. However, I think we can try to find a solution for this. Anyway, we need some way to track per-page status in a folio, regardless how to implement it. >> >> >> >> >>> It happens to manifest with THP in cgroups because that's what you >> >>> guys are testing. But IMO, any solution to this problem should >> >>> consider the wider scope. >> >>> >> >>>>>>>> I think large folio swapin would make the problem worse anyway. I am >> >>>>>>>> also not sure if the readahead window adjusts on memory pressure or >> >>>>>>>> not. >> >>>>>>>> >> >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being >> >>>>>>> seen here as there would be in swapin_readahead? >> >>>>>> >> >>>>>> Maybe readahead is not as aggressive in general as large folio >> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >> >>>>>> of the window is the smaller of page_cluster (2 or 3) and >> >>>>>> SWAP_RA_ORDER_CEILING (5). >> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might >> >>>>> be similar to enabling 32K mTHP? >> >>>> >> >>>> Not quite. >> >>> >> >>> Actually, I would expect it to be... >> >> >> >> Me too. >> >> >> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a >> >>>>>> contiguous allocation like large folio swapin. So that could be >> >>>>>> another factor why readahead may not reproduce the problem. >> >>>> >> >>>> Because of this ^. >> >>> >> >>> ...this matters for the physical allocation, which might require more >> >>> reclaim and compaction to produce the 32k. But an earlier version of >> >>> Barry's patch did the cgroup margin fallback after the THP was already >> >>> physically allocated, and it still helped. >> >>> >> >>> So the issue in this test scenario seems to be mostly about cgroup >> >>> volume. And then 8 4k charges should be equivalent to a singular 32k >> >>> charge when it comes to cgroup pressure. -- Best Regards, Huang, Ying