Re: [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations

Barry Song <21cnbao@xxxxxxxxx> · Tue, 5 Mar 2024 21:37:43 +1300

> TAO is an umbrella project aiming at a better economy of physical
> contiguity viewed as a valuable resource. A few examples are:
> 1. A multi-tenant system can have guaranteed THP coverage while
>    hosting abusers/misusers of the resource.
> 2. Abusers/misusers, e.g., workloads excessively requesting and then
>    splitting THPs, should be punished if necessary.
> 3. Good citizens should be awarded with, e.g., lower allocation
>    latency and less cost of metadata (struct page).

I think TAO or similar optimization in buddy is essential to the
success of mTHP.

Ryan's recent mTHP work can widely bring multi-size large folios to
various products while THP might be too large for them.

But a pain is that the buddy of a real device with limited memory
can be seriously fragmented after it runs for some time.

We(OPPO) have actually brought up mTHP-like features on millions of
phones even on 5.4, 5.10, 5.15 and 6.1 kernel with large folios
whose size are 64KiB to leverage ARM64's CONT-PTE. The open source
code for kernel 6.1 can be got here[1]. We found the success rate
of 64KiB allocation could be very low after running monkey[2] on
phones for one hour.

After the phone has been running for one hour, the below is the
data we collected from 60mins to 120mins(the second hour). w/o
TAO-like optimization to the existing buddy, 64KiB large folios
allocation can fall back to small folios at the rate of 92.35%
in do_anonymous_page().

thp_do_anon_pages_fallback / (thp_do_anon_pages + thp_do_anon_pages_fallback)
25807330 / 27944523 =  0.9235

in do_anonymous_page(), thp_do_anon_pages_fallback is the number
we try to allocate 64KiB but we fail, thus, we use small folios
instead; thp_do_anon_pages is the number we try to allocate 64KiB
and we succeed.

So this number somehow means mTHP has lost vast majority of value
on a fragmented system, while the fragmentation is always true
for a phone.

This has actually pushed us to implement a similar optimization
to avoid splitting 64KiB and award 64KiB allocation with lower
latency. Our implementation is different with TAO, rather than
adding new zones, we are adding migration_types to mark some
pageblocks are dedicated for mTHP allocation. And we avoid
splitting them into lower orders except for some corner cases.
This has significantly improved our success rate of 64KiB
large folios allocation and decreased the latency, helped
large folios to be finally applied in real products.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/
[2] https://developer.android.com/studio/test/other-testing-tools/monkey

> 4. Better interoperability with userspace memory allocators when
>    transacting the resource.
> 
> This project puts the same emphasis on the established use case for
> servers and the emerging use case for clients so that client workloads
> like Android and ChromeOS can leverage the recent multi-sized THPs
> [1][2].

> Chapter One introduces the cornerstone of TAO: an abstraction called
> policy (virtual) zones, which are overlayed on the physical zones.
> This is in line with item 1 above.
> 
> A new door is open after Chapter One. The following two chapters
> discuss the reverse of THP collapsing, called THP shattering, and THP
> HVO, which brings the hugeTLB feature [3] to THP. They are in line
> with items 2 & 3 above.
> 
> Advanced use cases are discussed in Epilogue, since they require the
> cooperation of userspace memory allocators. This is in line with item
> 4 above.
> 
> [1] https://lwn.net/Articles/932386/
> [2] https://lwn.net/Articles/937239/
> [3] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> 
> Yu Zhao (4):
>   THP zones: the use cases of policy zones
>   THP shattering: the reverse of collapsing
>   THP HVO: bring the hugeTLB feature to THP
>   Profile-Guided Heap Optimization and THP fungibility

Thanks
Barry