Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

Usama Arif <usamaarif642@xxxxxxxxx> · Wed, 30 Oct 2024 14:51:48 +0000

On 28/10/2024 22:03, Barry Song wrote:
> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>>
>>
>>
>> On 27/10/2024 01:14, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@xxxxxxxx>
>>>
>>> In a memcg where mTHP is always utilized, even at full capacity, it
>>> might not be the best option. Consider a system that uses only small
>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
>>> of buffer space before it can initiate the next reclamation. However,
>>> large folios can quickly fill this space, rapidly bringing the memcg
>>> back to full capacity, even though some portions of the large folios
>>> may not be immediately needed and used by the process.
>>>
>>> Usama and Kanchana identified a regression when building the kernel in
>>> a memcg with memory.max set to a small value while enabling large
>>> folio swap-in support on zswap[1].
>>>
>>> The issue arises from an edge case where the memory cgroup remains
>>> nearly full most of the time. Consequently, bringing in mTHP can
>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
>>> swap-in then recreates the overflow, resulting in a repetitive cycle.
>>>
>>> We need a mechanism to stop the cup from overflowing continuously.
>>> One potential solution is to slow the filling process when we identify
>>> that the cup is nearly full.
>>>
>>> Usama reported an improvement when we mitigate mTHP swap-in as the
>>> memcg approaches full capacity[2]:
>>>
>>> int mem_cgroup_swapin_charge_folio(...)
>>> {
>>>       ...
>>>       if (folio_test_large(folio) &&
>>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
>>>               ret = -ENOMEM;
>>>       else
>>>               ret = charge_memcg(folio, memcg, gfp);
>>>       ...
>>> }
>>>
>>> AMD 16K+32K THP=always
>>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
>>> real           1m23.038s        1m23.050s                                   1m22.704s
>>> user           53m57.210s       53m53.437s                                  53m52.577s
>>> sys            7m24.592s        7m48.843s                                   7m22.519s
>>> zswpin         612070           999244                                      815934
>>> zswpout        2226403          2347979                                     2054980
>>> pgfault        20667366         20481728                                    20478690
>>> pgmajfault     385887           269117                                      309702
>>>
>>> AMD 16K+32K+64K THP=always
>>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
>>> real           1m22.975s        1m23.266s                                  1m22.549s
>>> user           53m51.302s       53m51.069s                                 53m46.471s
>>> sys            7m40.168s        7m57.104s                                  7m25.012s
>>> zswpin         676492           1258573                                    1225703
>>> zswpout        2449839          2714767                                    2899178
>>> pgfault        17540746         17296555                                   17234663
>>> pgmajfault     429629           307495                                     287859
>>>
>>> I wonder if we can extend the mitigation to do_anonymous_page() as
>>> well. Without hardware like AMD and ARM with hardware TLB coalescing
>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
>>> builds in a memcg with memory.max set to 1 GiB.
>>>
>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>
>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
>>>
>>>             disable-mTHP-swapin  mm-unstable  with-this-patch
>>> Real:            6m54.595s      7m4.832s       6m45.811s
>>> User:            66m42.795s     66m59.984s     67m21.150s
>>> Sys:             12m7.092s      15m18.153s     12m52.644s
>>> pswpin:          4262327        11723248       5918690
>>> pswpout:         14883774       19574347       14026942
>>> 64k-swpout:      624447         889384         480039
>>> 32k-swpout:      115473         242288         73874
>>> 16k-swpout:      158203         294672         109142
>>> 64k-swpin:       0              495869         159061
>>> 32k-swpin:       0              219977         56158
>>> 16k-swpin:       0              223501         81445
>>>
>>
> 
> Hi Usama,
> 
>> hmm, both the user and sys time are worse with the patch compared to
>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
>> repeat the experiment the real time might be worse as well?
> 
> Well, I've improved my script to include a loop:
> 
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> 
> for ((i=1; i<=100; i++))
> do
>   echo "Executing round $i"
>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>   echo 3 > /proc/sys/vm/drop_caches
>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
>   cat /proc/vmstat | grep pswp
>   echo -n 64k-swpout: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
>   echo -n 32k-swpout: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
>   echo -n 16k-swpout: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
>   echo -n 64k-swpin: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
>   echo -n 32k-swpin: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
>   echo -n 16k-swpin: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
> done
> 
> I've noticed that the user/sys/real time on my i9 machine fluctuates
> constantly, could be things
> like:
> real    6m52.087s
> user    67m12.463s
> sys     13m8.281s
> ...
> 
> real    7m42.937s
> user    66m55.250s
> sys     12m56.330s
> ...
> 
> real    6m49.374s
> user    66m37.040s
> sys     12m44.542s
> ...
> 
> real    6m54.205s
> user    65m49.732s
> sys     11m33.078s
> ...
> 
> likely due to unstable temperatures and I/O latency. As a result, my
> data doesn’t seem
> reference-worthy.
> 

So I had suggested retrying the experiment to see how reproducible it is,
but had not done that myself!
Thanks for sharing this. I tried many times on the AMD server and I see
varying numbers as well.

AMD 16K THP always, cgroup = 4G, large folio zswapin patches
real    1m28.351s
user    54m14.476s
sys     8m46.596s
zswpin 811693
zswpout 2137310
pgfault 27344671
pgmajfault 290510
..
real    1m24.557s
user    53m56.815s
sys     8m10.200s
zswpin 571532
zswpout 1645063
pgfault 26989075
pgmajfault 205177
..
real    1m26.083s                                                                                                                                                                                                                                                                                                  
user    54m5.303s                                                                                                                                                                                                                                                                                                  
sys     9m55.247s                                                                                                                                                                                                                                                                                                  
zswpin 1176292                                                                                                                                                                                                                                                                                                     
zswpout 2910825                                                                                                                                                                                                                                                                                                    
pgfault 27286835                                                                                                                                                                                                                                                                                                   
pgmajfault 419746   

The sys time can especially vary by large numbers. I think you see the same.

> As a phone engineer, we never use phones to run kernel builds. I'm also
> quite certain that phones won't provide stable and reliable data for this
> type of workload. Without access to a Linux server to conduct the test,
> I really need your help.
> 
> I used to work on optimizing the ARM server scheduler and memory
> management, and I really miss that machine I had until three years ago :-)
> 
>>
>>> I need Usama's assistance to identify a suitable patch, as I lack
>>> access to hardware such as AMD machines and ARM servers with TLB
>>> optimization.
>>>
>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@xxxxxxxxx/
>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@xxxxxxxxx/
>>>
>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
>>> Cc: Usama Arif <usamaarif642@xxxxxxxxx>
>>> Cc: David Hildenbrand <david@xxxxxxxxxx>
>>> Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
>>> Cc: Chris Li <chrisl@xxxxxxxxxx>
>>> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
>>> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx>
>>> Cc: Kairui Song <kasong@xxxxxxxxxxx>
>>> Cc: Ryan Roberts <ryan.roberts@xxxxxxx>
>>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
>>> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
>>> Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
>>> Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
>>> Cc: Muchun Song <muchun.song@xxxxxxxxx>
>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
>>> ---
>>>  include/linux/memcontrol.h |  9 ++++++++
>>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
>>>  mm/memory.c                | 17 ++++++++++++++
>>>  3 files changed, 71 insertions(+)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 524006313b0d..8bcc8f4af39f 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>               long nr_pages);
>>>
>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>> +                             swp_entry_t *entry);
>>> +
>>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>>>                                 gfp_t gfp, swp_entry_t entry);
>>>
>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
>>>       return 0;
>>>  }
>>>
>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>> +             swp_entry_t *entry)
>>> +{
>>> +     return 0;
>>> +}
>>> +
>>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>>>  {
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 17af08367c68..f3d92b93ea6d 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>       return 0;
>>>  }
>>>
>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
>>> +{
>>> +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
>>
>> There might be 3 issues with the approach:
>>
>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
>> basically saying you need 512M free memory to swapin just 256K?
> 
> Right, sorry for the noisy code. I was just thinking about 4KB pages
> and wondering
> if we could simplify the code.
> 
>>
>> Its an uneven margin for different folio sizes.
>> For 16K folio swapin, you are checking if there is margin for 128 folios,
>> but for 1M folio swapin, you are checking there is margin for just 2 folios.
>>
>> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
> 
> Agreed. This is similar to what we discussed regarding your zswap mTHP
> swap-in series:
> 
>  int mem_cgroup_swapin_charge_folio(...)
>  {
>        ...
>        if (folio_test_large(folio) &&
>            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
> folio_nr_pages(folio)))
>                ret = -ENOMEM;
>        else
>                ret = charge_memcg(folio, memcg, gfp);
>        ...
>  }
> 
> As someone focused on phones, my challenge is the absence of stable platforms to
> benchmark this type of workload. If possible, Usama, I would greatly
> appreciate it if
> you could take the lead on the patch.
> 
>>
>> As Johannes pointed out, the charging code already does the margin check.
>> So for 4K, the check just checks if there is 4K available, but for 16K it checks
>> if a lot more than 16K is available. Maybe there should be a similar policy for
>> all? I guess this is similar to my 2nd point, but just considers 4K folios as
>> well.
> 
> I don't think the charging code performs a margin check. It simply
> tries to charge
> the specified nr_pages (whether 1 or more). If nr_pages are available,
> the charge
> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
> 

So if you have defrag not set to always, it will not trigger reclamation.
I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
used much more then always.

In the current code in that case try_charge_memcg will return -ENOMEM all
the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
try the next order. So eventhough it might not be calling the mem_cgroup_margin
function, it is kind of is doing the same?

> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
> immediately filling the memcg.
> 
> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
> additional reclamation. While nr_pages - 1 subpages of the large folio may not
> be immediately needed, they still occupy enough space to fill the memcg to
> capacity.
> 
> My second point about the mitigation is as follows: For a system (or
> memcg) under severe memory pressure, especially one without hardware TLB
> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> a larger granularity, some internal fragmentation is unavoidable, regardless
> of optimization. Could the mitigation code help in automatically tuning
> this fragmentation?
> 

I agree with the point that enabling mTHP always is not the right thing to do
on all platforms. I also think it might be the case that enabling mTHP
might be a good thing for some workloads, but enabling mTHP swapin along with
it might not.

As you said when you have apps switching between foreground and background
in android, it probably makes sense to have large folio swapping, as you
want to bringin all the pages from background app as quickly as possible.
And also all the TLB optimizations and smaller lru overhead you get after
you have brought in all the pages.
Linux kernel build test doesnt really get to benefit from the TLB optimization
and smaller lru overhead, as probably the pages are very short lived. So I
think it doesnt show the benefit of large folio swapin properly and
large folio swapin should probably be disabled for this kind of workload,
eventhough mTHP should be enabled.

I am not sure that the approach we are trying in this patch is the right way:
- This patch makes it a memcg issue, but you could have memcg disabled and
then the mitigation being tried here wont apply.
- Instead of this being a large folio swapin issue, is it more of a readahead
issue? If we zswap (without the large folio swapin series) and change the window
to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
when cgroup memory is limited as readahead would probably cause swap thrashing as
well.
- Instead of looking at cgroup margin, maybe we should try and look at
the rate of change of workingset_restore_anon? This might be a lot more complicated
to do, but probably is the right metric to determine swap thrashing. It also means
that this could be used in both the synchronous swapcache skipping path and
swapin_readahead path.
(Thanks Johannes for suggesting this)

With the large folio swapin, I do see the large improvement when considering only
swapin performance and latency in the same way as you saw in zram.
Maybe the right short term approach is to have
/sys/kernel/mm/transparent_hugepage/swapin
and have that disabled by default to avoid regression.
If the workload owner sees a benefit, they can enable it.
I can add this when sending the next version of large folio zswapin if that makes
sense?
Longer term I can try and have a look at if we can do something with
workingset_restore_anon to improve things.

Thanks,
Usama