On 30/10/2024 20:48, Barry Song wrote: > On Thu, Oct 31, 2024 at 9:41 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> >> >> On 30/10/2024 20:27, Barry Song wrote: >>> On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 28/10/2024 22:03, Barry Song wrote: >>>>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 27/10/2024 01:14, Barry Song wrote: >>>>>>> From: Barry Song <v-songbaohua@xxxxxxxx> >>>>>>> >>>>>>> In a memcg where mTHP is always utilized, even at full capacity, it >>>>>>> might not be the best option. Consider a system that uses only small >>>>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX >>>>>>> of buffer space before it can initiate the next reclamation. However, >>>>>>> large folios can quickly fill this space, rapidly bringing the memcg >>>>>>> back to full capacity, even though some portions of the large folios >>>>>>> may not be immediately needed and used by the process. >>>>>>> >>>>>>> Usama and Kanchana identified a regression when building the kernel in >>>>>>> a memcg with memory.max set to a small value while enabling large >>>>>>> folio swap-in support on zswap[1]. >>>>>>> >>>>>>> The issue arises from an edge case where the memory cgroup remains >>>>>>> nearly full most of the time. Consequently, bringing in mTHP can >>>>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent >>>>>>> swap-in then recreates the overflow, resulting in a repetitive cycle. >>>>>>> >>>>>>> We need a mechanism to stop the cup from overflowing continuously. >>>>>>> One potential solution is to slow the filling process when we identify >>>>>>> that the cup is nearly full. >>>>>>> >>>>>>> Usama reported an improvement when we mitigate mTHP swap-in as the >>>>>>> memcg approaches full capacity[2]: >>>>>>> >>>>>>> int mem_cgroup_swapin_charge_folio(...) >>>>>>> { >>>>>>> ... >>>>>>> if (folio_test_large(folio) && >>>>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) >>>>>>> ret = -ENOMEM; >>>>>>> else >>>>>>> ret = charge_memcg(folio, memcg, gfp); >>>>>>> ... >>>>>>> } >>>>>>> >>>>>>> AMD 16K+32K THP=always >>>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>>>>> real 1m23.038s 1m23.050s 1m22.704s >>>>>>> user 53m57.210s 53m53.437s 53m52.577s >>>>>>> sys 7m24.592s 7m48.843s 7m22.519s >>>>>>> zswpin 612070 999244 815934 >>>>>>> zswpout 2226403 2347979 2054980 >>>>>>> pgfault 20667366 20481728 20478690 >>>>>>> pgmajfault 385887 269117 309702 >>>>>>> >>>>>>> AMD 16K+32K+64K THP=always >>>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>>>>> real 1m22.975s 1m23.266s 1m22.549s >>>>>>> user 53m51.302s 53m51.069s 53m46.471s >>>>>>> sys 7m40.168s 7m57.104s 7m25.012s >>>>>>> zswpin 676492 1258573 1225703 >>>>>>> zswpout 2449839 2714767 2899178 >>>>>>> pgfault 17540746 17296555 17234663 >>>>>>> pgmajfault 429629 307495 287859 >>>>>>> >>>>>>> I wonder if we can extend the mitigation to do_anonymous_page() as >>>>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing >>>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with >>>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel >>>>>>> builds in a memcg with memory.max set to 1 GiB. >>>>>>> >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>>>>> >>>>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>>>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null >>>>>>> >>>>>>> disable-mTHP-swapin mm-unstable with-this-patch >>>>>>> Real: 6m54.595s 7m4.832s 6m45.811s >>>>>>> User: 66m42.795s 66m59.984s 67m21.150s >>>>>>> Sys: 12m7.092s 15m18.153s 12m52.644s >>>>>>> pswpin: 4262327 11723248 5918690 >>>>>>> pswpout: 14883774 19574347 14026942 >>>>>>> 64k-swpout: 624447 889384 480039 >>>>>>> 32k-swpout: 115473 242288 73874 >>>>>>> 16k-swpout: 158203 294672 109142 >>>>>>> 64k-swpin: 0 495869 159061 >>>>>>> 32k-swpin: 0 219977 56158 >>>>>>> 16k-swpin: 0 223501 81445 >>>>>>> >>>>>> >>>>> >>>>> Hi Usama, >>>>> >>>>>> hmm, both the user and sys time are worse with the patch compared to >>>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you >>>>>> repeat the experiment the real time might be worse as well? >>>>> >>>>> Well, I've improved my script to include a loop: >>>>> >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>>> >>>>> for ((i=1; i<=100; i++)) >>>>> do >>>>> echo "Executing round $i" >>>>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null >>>>> echo 3 > /proc/sys/vm/drop_caches >>>>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>>>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null >>>>> cat /proc/vmstat | grep pswp >>>>> echo -n 64k-swpout: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout >>>>> echo -n 32k-swpout: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout >>>>> echo -n 16k-swpout: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout >>>>> echo -n 64k-swpin: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin >>>>> echo -n 32k-swpin: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin >>>>> echo -n 16k-swpin: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin >>>>> done >>>>> >>>>> I've noticed that the user/sys/real time on my i9 machine fluctuates >>>>> constantly, could be things >>>>> like: >>>>> real 6m52.087s >>>>> user 67m12.463s >>>>> sys 13m8.281s >>>>> ... >>>>> >>>>> real 7m42.937s >>>>> user 66m55.250s >>>>> sys 12m56.330s >>>>> ... >>>>> >>>>> real 6m49.374s >>>>> user 66m37.040s >>>>> sys 12m44.542s >>>>> ... >>>>> >>>>> real 6m54.205s >>>>> user 65m49.732s >>>>> sys 11m33.078s >>>>> ... >>>>> >>>>> likely due to unstable temperatures and I/O latency. As a result, my >>>>> data doesn’t seem >>>>> reference-worthy. >>>>> >>>> >>>> So I had suggested retrying the experiment to see how reproducible it is, >>>> but had not done that myself! >>>> Thanks for sharing this. I tried many times on the AMD server and I see >>>> varying numbers as well. >>>> >>>> AMD 16K THP always, cgroup = 4G, large folio zswapin patches >>>> real 1m28.351s >>>> user 54m14.476s >>>> sys 8m46.596s >>>> zswpin 811693 >>>> zswpout 2137310 >>>> pgfault 27344671 >>>> pgmajfault 290510 >>>> .. >>>> real 1m24.557s >>>> user 53m56.815s >>>> sys 8m10.200s >>>> zswpin 571532 >>>> zswpout 1645063 >>>> pgfault 26989075 >>>> pgmajfault 205177 >>>> .. >>>> real 1m26.083s >>>> user 54m5.303s >>>> sys 9m55.247s >>>> zswpin 1176292 >>>> zswpout 2910825 >>>> pgfault 27286835 >>>> pgmajfault 419746 >>>> >>>> >>>> The sys time can especially vary by large numbers. I think you see the same. >>>> >>>> >>>>> As a phone engineer, we never use phones to run kernel builds. I'm also >>>>> quite certain that phones won't provide stable and reliable data for this >>>>> type of workload. Without access to a Linux server to conduct the test, >>>>> I really need your help. >>>>> >>>>> I used to work on optimizing the ARM server scheduler and memory >>>>> management, and I really miss that machine I had until three years ago :-) >>>>> >>>>>> >>>>>>> I need Usama's assistance to identify a suitable patch, as I lack >>>>>>> access to hardware such as AMD machines and ARM servers with TLB >>>>>>> optimization. >>>>>>> >>>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@xxxxxxxxx/ >>>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@xxxxxxxxx/ >>>>>>> >>>>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> >>>>>>> Cc: Usama Arif <usamaarif642@xxxxxxxxx> >>>>>>> Cc: David Hildenbrand <david@xxxxxxxxxx> >>>>>>> Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> >>>>>>> Cc: Chris Li <chrisl@xxxxxxxxxx> >>>>>>> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> >>>>>>> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx> >>>>>>> Cc: Kairui Song <kasong@xxxxxxxxxxx> >>>>>>> Cc: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>>>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> >>>>>>> Cc: Michal Hocko <mhocko@xxxxxxxxxx> >>>>>>> Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx> >>>>>>> Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx> >>>>>>> Cc: Muchun Song <muchun.song@xxxxxxxxx> >>>>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> >>>>>>> --- >>>>>>> include/linux/memcontrol.h | 9 ++++++++ >>>>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ >>>>>>> mm/memory.c | 17 ++++++++++++++ >>>>>>> 3 files changed, 71 insertions(+) >>>>>>> >>>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >>>>>>> index 524006313b0d..8bcc8f4af39f 100644 >>>>>>> --- a/include/linux/memcontrol.h >>>>>>> +++ b/include/linux/memcontrol.h >>>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, >>>>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>>>>>> long nr_pages); >>>>>>> >>>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>>>>>> + swp_entry_t *entry); >>>>>>> + >>>>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, >>>>>>> gfp_t gfp, swp_entry_t entry); >>>>>>> >>>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>>>>>> + swp_entry_t *entry) >>>>>>> +{ >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, >>>>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) >>>>>>> { >>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>>>>> index 17af08367c68..f3d92b93ea6d 100644 >>>>>>> --- a/mm/memcontrol.c >>>>>>> +++ b/mm/memcontrol.c >>>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) >>>>>>> +{ >>>>>>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { >>>>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) >>>>>> >>>>>> There might be 3 issues with the approach: >>>>>> >>>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have >>>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are >>>>>> basically saying you need 512M free memory to swapin just 256K? >>>>> >>>>> Right, sorry for the noisy code. I was just thinking about 4KB pages >>>>> and wondering >>>>> if we could simplify the code. >>>>> >>>>>> >>>>>> Its an uneven margin for different folio sizes. >>>>>> For 16K folio swapin, you are checking if there is margin for 128 folios, >>>>>> but for 1M folio swapin, you are checking there is margin for just 2 folios. >>>>>> >>>>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages? >>>>> >>>>> Agreed. This is similar to what we discussed regarding your zswap mTHP >>>>> swap-in series: >>>>> >>>>> int mem_cgroup_swapin_charge_folio(...) >>>>> { >>>>> ... >>>>> if (folio_test_large(folio) && >>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, >>>>> folio_nr_pages(folio))) >>>>> ret = -ENOMEM; >>>>> else >>>>> ret = charge_memcg(folio, memcg, gfp); >>>>> ... >>>>> } >>>>> >>>>> As someone focused on phones, my challenge is the absence of stable platforms to >>>>> benchmark this type of workload. If possible, Usama, I would greatly >>>>> appreciate it if >>>>> you could take the lead on the patch. >>>>> >>>>>> >>>>>> As Johannes pointed out, the charging code already does the margin check. >>>>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks >>>>>> if a lot more than 16K is available. Maybe there should be a similar policy for >>>>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as >>>>>> well. >>>>> >>>>> I don't think the charging code performs a margin check. It simply >>>>> tries to charge >>>>> the specified nr_pages (whether 1 or more). If nr_pages are available, >>>>> the charge >>>>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to >>>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. >>>>> >>>> >>>> So if you have defrag not set to always, it will not trigger reclamation. >>>> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably >>>> used much more then always. >>>> >>>> In the current code in that case try_charge_memcg will return -ENOMEM all >>>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then >>>> try the next order. So eventhough it might not be calling the mem_cgroup_margin >>>> function, it is kind of is doing the same? >>>> >>>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a >>>>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, >>>>> immediately filling the memcg. >>>>> >>>>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger >>>>> additional reclamation. While nr_pages - 1 subpages of the large folio may not >>>>> be immediately needed, they still occupy enough space to fill the memcg to >>>>> capacity. >>>>> >>>>> My second point about the mitigation is as follows: For a system (or >>>>> memcg) under severe memory pressure, especially one without hardware TLB >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless >>>>> of optimization. Could the mitigation code help in automatically tuning >>>>> this fragmentation? >>>>> >>>> >>>> I agree with the point that enabling mTHP always is not the right thing to do >>>> on all platforms. I also think it might be the case that enabling mTHP >>>> might be a good thing for some workloads, but enabling mTHP swapin along with >>>> it might not. >>>> >>>> As you said when you have apps switching between foreground and background >>>> in android, it probably makes sense to have large folio swapping, as you >>>> want to bringin all the pages from background app as quickly as possible. >>>> And also all the TLB optimizations and smaller lru overhead you get after >>>> you have brought in all the pages. >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization >>>> and smaller lru overhead, as probably the pages are very short lived. So I >>>> think it doesnt show the benefit of large folio swapin properly and >>>> large folio swapin should probably be disabled for this kind of workload, >>>> eventhough mTHP should be enabled. >>> >>> I'm not entirely sure if this applies to platforms without TLB >>> optimization, especially >>> in the absence of swap. In a memory-limited cgroup without swap, would >>> mTHP still >>> cause significant thrashing of file-backed folios? When a large swap >>> file is present, >>> the inability to swap in mTHP seems to act as a workaround for fragmentation, >>> allowing fragmented pages of the original mTHP from do_anonymous_page() to >>> remain in swap. >>> >>>> >>>> I am not sure that the approach we are trying in this patch is the right way: >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>> then the mitigation being tried here wont apply. >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>> issue? If we zswap (without the large folio swapin series) and change the window >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>> well. >>>> - Instead of looking at cgroup margin, maybe we should try and look at >>>> the rate of change of workingset_restore_anon? This might be a lot more complicated >>>> to do, but probably is the right metric to determine swap thrashing. It also means >>>> that this could be used in both the synchronous swapcache skipping path and >>>> swapin_readahead path. >>>> (Thanks Johannes for suggesting this) >>>> >>>> With the large folio swapin, I do see the large improvement when considering only >>>> swapin performance and latency in the same way as you saw in zram. >>>> Maybe the right short term approach is to have >>>> /sys/kernel/mm/transparent_hugepage/swapin >>>> and have that disabled by default to avoid regression. >>> >>> A crucial component is still missing—managing the compression and decompression >>> of multiple pages as a larger block. This could significantly reduce >>> system time and >>> potentially resolve the kernel build issue within a small memory >>> cgroup, even with >>> swap thrashing. >>> >>> I’ll send an update ASAP so you can rebase for zswap. >> >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@xxxxxxxxx/? >> Thats wont benefit zswap, right? > > That's right. I assume we can also make it work with zswap? Hopefully yes. Thats mainly why I was looking at that series, to try and find a way to do something similar for zswap. > >> I actually had a few questions about it. Mainly that the benefit comes if the >> pagefault happens on page 0 of the large folio. But if the page fault happens >> on any other page, lets say page 1 of a 64K folio. then it will decompress the >> entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial). >> Could that cause a regression as you have to decompress a large chunk for just >> getting 1 4K page? >> If we assume uniform distribution of page faults, maybe it could make things worse? >> >> I probably should ask all of this in that thread. > > With mTHP swap-in, a page fault on any page behaves the same as a fault on > page 0. Without mTHP swap-in, there’s also no difference between > faults on page 0 > and other pages. Ah ok, its because of the ALIGN_DOWN in https://elixir.bootlin.com/linux/v6.12-rc5/source/mm/memory.c#L4158, right? > > A fault on any page means that the entire block is decompressed. The > only difference > is that we don’t partially copy one page when mTHP swap-in is present. > Ah so zram_bvec_read_multi_pages_partial would be called only if someone swaps out mTHP, disables it and then tries to do swapin? Thanks >> >>> >>>> If the workload owner sees a benefit, they can enable it. >>>> I can add this when sending the next version of large folio zswapin if that makes >>>> sense? >>>> Longer term I can try and have a look at if we can do something with >>>> workingset_restore_anon to improve things. >>>> >>>> Thanks, >>>> Usama > > Thanks > Barry