Re: [PATCH 0/6] mm: split underutilized THPs

Yu Zhao <yuzhao@xxxxxxxxxx> · Tue, 6 Aug 2024 14:53:33 -0600

On Tue, Aug 6, 2024 at 1:54 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Tue, Aug 06, 2024 at 12:06:20PM -0600, Yu Zhao wrote:
> > On Tue, Aug 6, 2024 at 11:38 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > >
> > > On Thu, Aug 01, 2024 at 12:09:16AM -0600, Yu Zhao wrote:
> > > > On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> > > > >
> > > > > The current upstream default policy for THP is always. However, Meta
> > > > > uses madvise in production as the current THP=always policy vastly
> > > > > overprovisions THPs in sparsely accessed memory areas, resulting in
> > > > > excessive memory pressure and premature OOM killing.
> > > > > Using madvise + relying on khugepaged has certain drawbacks over
> > > > > THP=always. Using madvise hints mean THPs aren't "transparent" and
> > > > > require userspace changes. Waiting for khugepaged to scan memory and
> > > > > collapse pages into THP can be slow and unpredictable in terms of performance
> > > > > (i.e. you dont know when the collapse will happen), while production
> > > > > environments require predictable performance. If there is enough memory
> > > > > available, its better for both performance and predictability to have
> > > > > a THP from fault time, i.e. THP=always rather than wait for khugepaged
> > > > > to collapse it, and deal with sparsely populated THPs when the system is
> > > > > running out of memory.
> > > > >
> > > > > This patch-series is an attempt to mitigate the issue of running out of
> > > > > memory when THP is always enabled. During runtime whenever a THP is being
> > > > > faulted in or collapsed by khugepaged, the THP is added to a list.
> > > > > Whenever memory reclaim happens, the kernel runs the deferred_split
> > > > > shrinker which goes through the list and checks if the THP was underutilized,
> > > > > i.e. how many of the base 4K pages of the entire THP were zero-filled.
> > > > > If this number goes above a certain threshold, the shrinker will attempt
> > > > > to split that THP. Then at remap time, the pages that were zero-filled are
> > > > > not remapped, hence saving memory. This method avoids the downside of
> > > > > wasting memory in areas where THP is sparsely filled when THP is always
> > > > > enabled, while still providing the upside THPs like reduced TLB misses without
> > > > > having to use madvise.
> > > > >
> > > > > Meta production workloads that were CPU bound (>99% CPU utilzation) were
> > > > > tested with THP shrinker. The results after 2 hours are as follows:
> > > > >
> > > > >                             | THP=madvise |  THP=always   | THP=always
> > > > >                             |             |               | + shrinker series
> > > > >                             |             |               | + max_ptes_none=409
> > > > > -----------------------------------------------------------------------------
> > > > > Performance improvement     |      -      |    +1.8%      |     +1.7%
> > > > > (over THP=madvise)          |             |               |
> > > > > -----------------------------------------------------------------------------
> > > > > Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> > > > > -----------------------------------------------------------------------------
> > > > > max_ptes_none=409 means that any THP that has more than 409 out of 512
> > > > > (80%) zero filled filled pages will be split.
> > > > >
> > > > > To test out the patches, the below commands without the shrinker will
> > > > > invoke OOM killer immediately and kill stress, but will not fail with
> > > > > the shrinker:
> > > > >
> > > > > echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> > > > > mkdir /sys/fs/cgroup/test
> > > > > echo $$ > /sys/fs/cgroup/test/cgroup.procs
> > > > > echo 20M > /sys/fs/cgroup/test/memory.max
> > > > > echo 0 > /sys/fs/cgroup/test/memory.swap.max
> > > > > # allocate twice memory.max for each stress worker and touch 40/512 of
> > > > > # each THP, i.e. vm-stride 50K.
> > > > > # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> > > > > # killer.
> > > > > # Without the shrinker, OOM killer is invoked immediately irrespective
> > > > > # of max_ptes_none value and kill stress.
> > > > > stress --vm 1 --vm-bytes 40M --vm-stride 50K
> > > > >
> > > > > Patches 1-2 add back helper functions that were previously removed
> > > > > to operate on page lists (needed by patch 3).
> > > > > Patch 3 is an optimization to free zapped tail pages rather than
> > > > > waiting for page reclaim or migration.
> > > > > Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
> > > > > subpages when splitting THP.
> > > > > Patches 6 adds support for THP shrinker.
> > > > >
> > > > > (This patch-series restarts the work on having a THP shrinker in kernel
> > > > > originally done in
> > > > > https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@xxxxxx/.
> > > > > The THP shrinker in this series is significantly different than the
> > > > > original one, hence its labelled v1 (although the prerequisite to not
> > > > > remap clean subpages is the same).)
> > > > >
> > > > > Alexander Zhu (1):
> > > > >   mm: add selftests to split_huge_page() to verify unmap/zap of zero
> > > > >     pages
> > > > >
> > > > > Usama Arif (3):
> > > > >   Revert "memcg: remove mem_cgroup_uncharge_list()"
> > > > >   Revert "mm: remove free_unref_page_list()"
> > > > >   mm: split underutilized THPs
> > > > >
> > > > > Yu Zhao (2):
> > > > >   mm: free zapped tail pages when splitting isolated thp
> > > > >   mm: don't remap unused subpages when splitting isolated thp
> > > >
> > > >  I would recommend shatter [1] instead of splitting so that
> > >
> > > I agree with Rik, this seems like a possible optimization, not a
> > > pre-requisite.
>
> Let me just re-iterate, I don't think this discussion has any bearing
> on the THP shrinker.

I'm not sure why you think one of the two ways to free zero pages from
a THP has no bearing on each other.

> There is data corroborating that the shrinker
> as-is is useful today.

I had data that showed it too, before posting the two patches included
in this series over 3 years ago. I have also made additional findings
since then, as I was sharing with you.

> Shattering is an independent proposal for an optimization that should
> be discussed on its own merits.

I'm not trying to derail it. I only listed an option for you to
consider, based on my new findings. And as I said earlier to Rik, I'm
fine with whatever you prefer.

> > > > 1) whoever underutilized their THPs get punished for the overhead;
> > >
> > > Is that true?
> >
> > Yes :)
>
> It actually sounds like the answer is no.
>
> > > The downgrade is done in a shrinker.
> >
> > Ideally, should we charge for the CPU usage of the shrinker and who
> > should we charge it to?
>
> There are two contexts it runs in:
>
> 1) Task tries to allocate a THP. A physical one cannot be found, so
>    direct reclaim and compaction run. The allocating task and its
>    cgroup get charged for this effort, regardless of who underutilized
>    the page, and regardless of whether we split+compact or shatter.
>
> 2) Cgroup tries to charge a THP and hits its limit. The allocating
>    task runs limit reclaim, which invokes the shrinker. The job of
>    this context is to enforce memory quantity, not contiguity. A limit
>    can be hit when the system is 5% utilized, with an abundance of
>    free blocks. With the shrinker shattering, the cgroup would be
>    needlessly punished. Twice in fact:
>
>             a) it would perform unnecessary migration work
>
>             b) splitting retains PTE contiguity on the remaining
>                subpages, which has benefits on CPUs with TLB
>                coalescing. Shattering would disrupt that needlessly.
>
>    This is the wrong context for contiguity work.
>
> So it seems to me that the punishment of "culprits" is not natural,
> reliable, or proportional in any way.

I see why we are disagreeing with each other, because we are viewing
the problem in very different frames.

Never mind. We can find another time in another thread to sync up.

> > > With or without
> > > shattering, the compaction effort will be on the allocation side.
> >
> > If compaction is needed at all.
>
> That's not what I meant.
>
> Shattering IS compaction work. There is a migration source block, and
> it moves the individual pages to somewhere else to produce a free THP.
>
> > > The only difference is that
> > > compaction would do the work on-demand
> >
> > And can often fail to produce 2MB blocks *under memory pressure*
>
> So let's agree that shattering helps if the THP demand is constant.
>
> It doesn't help when THP demand *grows* under pressure. But fixing
> this is kind of critical for THPs to be useful.

Same here.

A quick answer is that I'm thinking (and working) in the frame of THP
memcg-prioritized (fragmentation containerized), which might add
another thing you disagree with.

So yes, that should be a different discussion somewhere else.

> And fixing the latter obviates the need for a special solution to only
> the former.
>
> > > whereas shattering would do it unconditionally
> >
> > And always produces a 2MB block
>
> Which can immediately get fragmented by a racing allocation from an
> unrelated cgroup. So the next step would be compaction capturing for
> the shattering code...
>
> I'm not really sold on this. It's re-inventing compaction in an ad-hoc
> scenario that has limited usefulness, instead of addressing the
> fundamental issues that make compaction inefficient & unreliable.
>
> Allocation context is the appropriate place to determine whether, and
> how much, defragmentation (and all its direct and indirect costs) is
> actually warranted. Not opportunistically deep in the reclaim stack.