Barry Song <21cnbao@xxxxxxxxx> writes: > On Mon, Nov 4, 2024 at 7:46 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Johannes Weiner <hannes@xxxxxxxxxxx> writes: >> >> > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >> >> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> > On 30/10/2024 21:01, Yosry Ahmed wrote: >> >> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> > >>>> I am not sure that the approach we are trying in this patch is the right way: >> >> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >> >> > >>>> then the mitigation being tried here wont apply. >> >> > >>> >> >> > >>> Is the problem reproducible without memcg? I imagine only if the >> >> > >>> entire system is under memory pressure. I guess we would want the same >> >> > >>> "mitigation" either way. >> >> > >>> >> >> > >> What would be a good open source benchmark/workload to test without limiting memory >> >> > >> in memcg? >> >> > >> For the kernel build test, I can only get zswap activity to happen if I build >> >> > >> in cgroup and limit memory.max. >> >> > > >> >> > > You mean a benchmark that puts the entire system under memory >> >> > > pressure? I am not sure, it ultimately depends on the size of memory >> >> > > you have, among other factors. >> >> > > >> >> > > What if you run the kernel build test in a VM? Then you can limit is >> >> > > size like a memcg, although you'd probably need to leave more room >> >> > > because the entire guest OS will also subject to the same limit. >> >> > > >> >> > >> >> > I had tried this, but the variance in time/zswap numbers was very high. >> >> > Much higher than the AMD numbers I posted in reply to Barry. So found >> >> > it very difficult to make comparison. >> >> >> >> Hmm yeah maybe more factors come into play with global memory >> >> pressure. I am honestly not sure how to test this scenario, and I >> >> suspect variance will be high anyway. >> >> >> >> We can just try to use whatever technique we use for the memcg limit >> >> though, if possible, right? >> > >> > You can boot a physical machine with mem=1G on the commandline, which >> > restricts the physical range of memory that will be initialized. >> > Double check /proc/meminfo after boot, because part of that physical >> > range might not be usable RAM. >> > >> > I do this quite often to test physical memory pressure with workloads >> > that don't scale up easily, like kernel builds. >> > >> >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >> >> > >>>> issue? If we zswap (without the large folio swapin series) and change the window >> >> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> >> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >> >> > >>>> well. >> > >> > +1 >> > >> > I also think there is too much focus on cgroup alone. The bigger issue >> > seems to be how much optimistic volume we swap in when we're under >> > pressure already. This applies to large folios and readahead; global >> > memory availability and cgroup limits. >> >> The current swap readahead logic is something like, >> >> 1. try readahead some pages for sequential access pattern, mark them as >> readahead >> >> 2. if these readahead pages get accessed before swapped out again, >> increase 'hits' counter >> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. >> >> So, if there's heavy memory pressure, the readaheaded pages will not be >> accessed before being swapped out again (in 2 above), the readahead >> pages will be minimal. >> >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in >> addition to the pages accessed are swapped in, the adjacent pages are >> swapped in (swap readahead) too. If these readahead pages are not >> accessed before swapped out again, system runs into more severe >> thrashing. This is because we lack the swap readahead window scaling >> mechanism as above. And, this is why I suggested to combine the swap >> readahead mechanism and mTHP swap-in by default before. That is, when >> kernel swaps in a page, it checks current swap readahead window, and >> decides mTHP order according to window size. So, if there are heavy >> memory pressure, so that the nearby pages will not be accessed before >> being swapped out again, the mTHP swap-in order can be adjusted >> automatically. > > This might help reduce memory reclamation thrashing for kernel build > workload running in a memory limited memcg which might not benefit > from mTHP that much. But this mechanism has clear disadvantages: > > 1. Loss of large folios: For example, if you're using app A and then switch > to app B, a significant portion (around 60%) of A's memory might be swapped > out as it moves to the background. When you switch back to app A, a large > portion of the memory originally in mTHP could be lost while swapping > in small folios. Why? Most app A pages could be swapped in as mTHP if all readahead pages are accessed before being swapped out again. > Essentially, systems with a user interface operate quite differently from kernel > build workloads running under a memory-limited memcg, as they switch > applications between the foreground and background. > > 2. Fragmentation of swap slots: This fragmentation increases the likelihood of > mTHP swapout failures, as it makes it harder to maintain contiguous memory > blocks in swap. > > 3. Prevent the implementation of large block compression and decompression > to achieve higher compression ratios and significantly lower CPU consumption, > as small folio swap-ins may still remain the predominant approach. > > Memory-limited systems often face challenges with larger page sizes. Even on > systems that support various base page sizes, such as 4KB, 16KB, and 64KB > on ARM64, using 16KB or 64KB as the base page size is not always the best > choice. With mTHP, we have already enabled per-size settings. For this kernel > build workload operating within a limited memcg, enabling only 16kB is likely > the best option for optimizing performance and minimizing thrashing. > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > > We could focus on mTHP and seek strategies to minimize thrashing when free > memory is severely limited :-) IIUC, mTHP can improve performance but may waste memory too. The best mTHP order may be different for different workloads and system states (e.g. high memory pressure). So, we need a way to identify the best mTHP order automatically. Swap readahead window size auto-scaling can be used to design a method to identify the best mTHP order for swap-in. I'm open to other possible methods too. I admit a fixed mTHP order can be best for some specific workloads or system states. However, it may be hard to find one mTHP order that works well for all workloads and system states. >> > It happens to manifest with THP in cgroups because that's what you >> > guys are testing. But IMO, any solution to this problem should >> > consider the wider scope. >> > >> >> > >>> I think large folio swapin would make the problem worse anyway. I am >> >> > >>> also not sure if the readahead window adjusts on memory pressure or >> >> > >>> not. >> >> > >>> >> >> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being >> >> > >> seen here as there would be in swapin_readahead? >> >> > > >> >> > > Maybe readahead is not as aggressive in general as large folio >> >> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >> >> > > of the window is the smaller of page_cluster (2 or 3) and >> >> > > SWAP_RA_ORDER_CEILING (5). >> >> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might >> >> > be similar to enabling 32K mTHP? >> >> >> >> Not quite. >> > >> > Actually, I would expect it to be... >> >> Me too. >> >> >> > > Also readahead will swapin 4k folios AFAICT, so we don't need a >> >> > > contiguous allocation like large folio swapin. So that could be >> >> > > another factor why readahead may not reproduce the problem. >> >> >> >> Because of this ^. >> > >> > ...this matters for the physical allocation, which might require more >> > reclaim and compaction to produce the 32k. But an earlier version of >> > Barry's patch did the cgroup margin fallback after the THP was already >> > physically allocated, and it still helped. >> > >> > So the issue in this test scenario seems to be mostly about cgroup >> > volume. And then 8 4k charges should be equivalent to a singular 32k >> > charge when it comes to cgroup pressure. >> -- Best Regards, Huang, Ying