Re: [PATCH 0/6] mm: split underutilized THPs

David Hildenbrand <david@xxxxxxxxxx> · Thu, 1 Aug 2024 18:27:21 +0200

On 01.08.24 18:22, Usama Arif wrote:

On 01/08/2024 07:09, Yu Zhao wrote:
On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:

The current upstream default policy for THP is always. However, Meta
uses madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing.
Using madvise + relying on khugepaged has certain drawbacks over
THP=always. Using madvise hints mean THPs aren't "transparent" and
require userspace changes. Waiting for khugepaged to scan memory and
collapse pages into THP can be slow and unpredictable in terms of performance
(i.e. you dont know when the collapse will happen), while production
environments require predictable performance. If there is enough memory
available, its better for both performance and predictability to have
a THP from fault time, i.e. THP=always rather than wait for khugepaged
to collapse it, and deal with sparsely populated THPs when the system is
running out of memory.

This patch-series is an attempt to mitigate the issue of running out of
memory when THP is always enabled. During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list.
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underutilized,
i.e. how many of the base 4K pages of the entire THP were zero-filled.
If this number goes above a certain threshold, the shrinker will attempt
to split that THP. Then at remap time, the pages that were zero-filled are
not remapped, hence saving memory. This method avoids the downside of
wasting memory in areas where THP is sparsely filled when THP is always
enabled, while still providing the upside THPs like reduced TLB misses without
having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker. The results after 2 hours are as follows:

                             | THP=madvise |  THP=always   | THP=always
                             |             |               | + shrinker series
                             |             |               | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement     |      -      |    +1.8%      |     +1.7%
(over THP=madvise)          |             |               |
-----------------------------------------------------------------------------
Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with
the shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kill stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K

Patches 1-2 add back helper functions that were previously removed
to operate on page lists (needed by patch 3).
Patch 3 is an optimization to free zapped tail pages rather than
waiting for page reclaim or migration.
Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
subpages when splitting THP.
Patches 6 adds support for THP shrinker.

(This patch-series restarts the work on having a THP shrinker in kernel
originally done in
https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@xxxxxx/.
The THP shrinker in this series is significantly different than the
original one, hence its labelled v1 (although the prerequisite to not
remap clean subpages is the same).)

Alexander Zhu (1):
   mm: add selftests to split_huge_page() to verify unmap/zap of zero
     pages

Usama Arif (3):
   Revert "memcg: remove mem_cgroup_uncharge_list()"
   Revert "mm: remove free_unref_page_list()"
   mm: split underutilized THPs

Yu Zhao (2):
   mm: free zapped tail pages when splitting isolated thp
   mm: don't remap unused subpages when splitting isolated thp

  I would recommend shatter [1] instead of splitting so that
1) whoever underutilized their THPs get punished for the overhead;
2) underutilized THPs are kept intact and can be reused by others.

[1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@xxxxxxxxxx/

The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always. Punishing any applications performance is the opposite of what I am trying to do here.
For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense.

I'm not sure if there would really be a performance degradation 
regarding the THP, after all we zap PTEs either way.

Shattering will take longer because real migration is involved IIUC.

Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed? i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K? Is it the case the THP is reused at next fault?

The idea is (as I understand it) to free the full THP abck to the buddy, 
replacing the individual pieces that are kept to freshly allocated 
order-0 pages from the buddy.

--
Cheers,

David / dhildenb