Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation

Yang Shi <shy828301@xxxxxxxxx> · Wed, 15 May 2024 07:49:22 -0600

On Tue, May 14, 2024 at 3:20 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@xxxxxxxxx> wrote:
> >
> > On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > I'd like to propose a session about the allocation and reclamation of
> > > mTHP. This is related to Yu Zhao's
> > > TAO[1] but not the same.
> > >
> > > OPPO has implemented mTHP-like large folios across thousands of
> > > genuine Android devices, utilizing
> > > ARM64 CONT-PTE. However, we've encountered challenges:
> > >
> > > - The allocation of mTHP isn't consistently reliable; even after
> > > prolonged use, obtaining large folios
> > >   remains uncertain.
> > >   As an instance, following a few hours of operation, the likelihood
> > > of successfully allocating large
> > >   folios on a phone may decrease to just 2%.
> > >
> > > - Mixing large and small folios in the same LRU list can lead to
> > > mutual blocking and unpredictable
> > >   latency during reclamation/allocation.
> >
> > I'm also curious how much large folios can improve reclamation
> > efficiency. Having large folios is supposed to reduce the scan time
> > since there should be fewer folios on LRU. But IIRC I haven't seen too
> > much data or benchmark (particularly real life workloads) regarding
> > this.
>
> Hi Yang,
>
> We lack direct data on this matter, but information from Ryan's THP_SWPOUT
> series [1] provides insights as follows:
>
> | alloc size |                baseline |           + this series |
> |            | mm-unstable (~v6.9-rc1) |                         |
> |:-----------|------------------------:|------------------------:|
> | 4K Page    |                    0.0% |                    1.3% |
> | 64K THP    |                  -13.6% |                   46.3% |
> | 2M THP     |                   91.4% |                   89.6% |
>
>
> I suspect the -13.6% performance decrease is due to the split
> operation. Once the split
> is eliminated, the patchset observed a 46.3% increase. It is presumed
> that the overhead
> required to reclaim 64K is reduced compared to reclaiming 16 * 4K.

Thank you. Actually I care about 4k vs 64k vs 256k ...

I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the
swapout optimization then measured the time spent in madvise, I can
see the time was reduced by ~23% between 64k vs 4k. Then there is no
noticeable reduction between 64k and larger sizes.

Actually I saw such a pattern (performance doesn't scale with page
size after 64K) with some real life workload benchmark. I'm going to
talk about it in today's LSF/MM.

>
> However, at present, in actual android devices, we are observing
> nearly 100% occurrence
> of anon_thp_swpout_fallback after the device has been in operation for
> several hours[2].
>
> Hence, it is likely that we will experience regression instead of
> improvement due to the
> absence of measures to mitigate swap fragmentation.
>
> [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@xxxxxxx/
> [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/
>
> >
> > >
> > >   For instance, if you require large folios, the LRU list's tail could
> > > be filled with small folios.
> > >   LRU(LF- large folio, SF- small folio):
> > >
> > >    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
> > >
> > >  You might end up reclaiming many small folios yet still struggle to
> > > allocate large folios. Conversely,
> > >  the inverse scenario can occur when the LRU list's tail is populated
> > > with large folios.
> > >
> > >    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
> > >
> > > In OPPO's products, we allocate dedicated pageblocks solely for large
> > > folios allocation, and we've
> > > fine-tuned the LRU mechanism to support dual LRU—one for small folios
> > > and another for large ones.
> > > Dedicated page blocks offer a fundamental guarantee of allocating
> > > large folios. Additionally, segregating
> > > small and large folios into two LRUs ensures that both can be
> > > efficiently reclaimed for their respective
> > > users' requests.  However, while the implementation may lack aesthetic
> > > appeal and is primarily tailored
> > > for product purposes, it isn't fully upstreamable.
> > >
> > > You can obtain the architectural diagram of OPPO's approach from link[2].
> > >
> > > Therefore, my plan is to present:
> > >
> > > - Introduce the architecture of OPPO's mTHP-like approach, which
> > > encompasses additional optimizations
> > >   we've made to address swap fragmentation issues and improve swap
> > > performance, such as dual-zRAM
> > >   and compression/decompression of large folios [3].
> > >
> > > - Present OPPO's method of utilizing dedicated page blocks and a
> > > dual-LRU system for mTHP.
> > >
> > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
> > >
> > > - Discuss our future direction—are we leaning towards TAO or dedicated
> > > page blocks? If we opt for page
> > >   blocks, how do we plan to resolve the LRU issue?
> > >
> > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@xxxxxxxxxx/
> > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@xxxxxxxxx/
> > >
> Thanks,
> Barry