[Epilogue] Profile-Guided Heap Optimization and THP fungibility

Yu Zhao <yuzhao@xxxxxxxxxx> · Thu, 29 Feb 2024 11:34:36 -0700

In a nutshell, Profile-Guided Heap Optimization (PGHO) [1] allows
userspace memory allocators, e.g., TCMalloc [2], to
1. Group memory objects by hotness so that the accessed bit in the PMD
   entry mapping a THP can better reflect the overall hotness of that
   THP. A counterexample is a single hot page shielding the rest of
   the cold pages in that THP from being reclaimed.
2. Group objects by lifetime to reduce the chance of split. Frequency
   split increases the entropy of a system and can cause a higher
   consumption of physical contiguity and reduced overall performance
   (due to TLB misses [2]).

None of PGOs (PGHO included) can account for every runtime behavior.
For example, an object predicated hot or long-lived can turn out to be
cold or short-lived. However, the kernel may not be able to reclaim
the THP containing that object because of the aforementioned reasons.
Instead, userspace memory allocators can choose to MADV_COLD or
MADV_FREE that object to avoid reclaiming other hot folios or OOM
kills. This is part of the whole process, called THP fungibility, and
it ends up with the split of the THP containing that object.

The full circle completes after userspace memory allocators recover
the THP from the split above. This part, called MADV_RECOVER, is done
by "collapsing" the pages of the original THP in place. Pages that
have been reused since the split are either reclaimed or migrated so
that they can become free again. Compared with MADV_COLLAPSE,
MADV_RECOVER has the following advantages:
1. It is more likely to succeed, since it does not try to allocate a
   new THP.
2. It does not copy the pages that are already in place and therefore
   has a smaller window during which the hot objects in those pages
   are inaccessible.

In essence, THP fungibility is a cooperation between the userspace
memory allocator and TAO to better utilize physical contiguity in a
system. It extends the heuristics for the bin packing problem from the
allocation time (mobility and size as described in Chapter One) to the
runtime (hotness and lifetime). Machine learning is likely to become
the autotuner in the foreseeable future, just as it has with the
software-defined far memory at Google [3].

[1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
[2] https://www.usenix.org/conference/osdi21/presentation/hunter
[3] https://research.google/pubs/software-defined-far-memory-in-warehouse-scale-computers/
-- 
2.44.0.rc1.240.g4c46232300-goog