On Wed, May 15, 2024 at 1:25 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > On Thu, May 16, 2024 at 1:49 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > On Tue, May 14, 2024 at 3:20 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > > > On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > > > > > On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > > > > > > > Hi, > > > > > > > > > > I'd like to propose a session about the allocation and reclamation of > > > > > mTHP. This is related to Yu Zhao's > > > > > TAO[1] but not the same. > > > > > > > > > > OPPO has implemented mTHP-like large folios across thousands of > > > > > genuine Android devices, utilizing > > > > > ARM64 CONT-PTE. However, we've encountered challenges: > > > > > > > > > > - The allocation of mTHP isn't consistently reliable; even after > > > > > prolonged use, obtaining large folios > > > > > remains uncertain. > > > > > As an instance, following a few hours of operation, the likelihood > > > > > of successfully allocating large > > > > > folios on a phone may decrease to just 2%. > > > > > > > > > > - Mixing large and small folios in the same LRU list can lead to > > > > > mutual blocking and unpredictable > > > > > latency during reclamation/allocation. > > > > > > > > I'm also curious how much large folios can improve reclamation > > > > efficiency. Having large folios is supposed to reduce the scan time > > > > since there should be fewer folios on LRU. But IIRC I haven't seen too > > > > much data or benchmark (particularly real life workloads) regarding > > > > this. > > > > > > Hi Yang, > > > > > > We lack direct data on this matter, but information from Ryan's THP_SWPOUT > > > series [1] provides insights as follows: > > > > > > | alloc size | baseline | + this series | > > > | | mm-unstable (~v6.9-rc1) | | > > > |:-----------|------------------------:|------------------------:| > > > | 4K Page | 0.0% | 1.3% | > > > | 64K THP | -13.6% | 46.3% | > > > | 2M THP | 91.4% | 89.6% | > > > > > > > > > I suspect the -13.6% performance decrease is due to the split > > > operation. Once the split > > > is eliminated, the patchset observed a 46.3% increase. It is presumed > > > that the overhead > > > required to reclaim 64K is reduced compared to reclaiming 16 * 4K. > > > > Thank you. Actually I care about 4k vs 64k vs 256k ... > > > > I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the > > swapout optimization then measured the time spent in madvise, I can > > see the time was reduced by ~23% between 64k vs 4k. Then there is no > > noticeable reduction between 64k and larger sizes. > > If you engage in perf analysis, what observations can you make? I suspect that > even with larger folios, the function try_to_unmap_one() continues to iterate > through PTEs individually. Yes, I think so. > If we're able to batch the unmapping process for the entire folio, we might > observe improved performance. I did profiling to my benchmark, I didn't see try_to_unmap showed as hot spot. The time is actually spent in zram I/O. But batching try_to_unmap() may show some improvement. Did you do it in your kernel? It should be worth exploring. > > > > > Actually I saw such a pattern (performance doesn't scale with page > > size after 64K) with some real life workload benchmark. I'm going to > > talk about it in today's LSF/MM. > > > > > > > > However, at present, in actual android devices, we are observing > > > nearly 100% occurrence > > > of anon_thp_swpout_fallback after the device has been in operation for > > > several hours[2]. > > > > > > Hence, it is likely that we will experience regression instead of > > > improvement due to the > > > absence of measures to mitigate swap fragmentation. > > > > > > [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@xxxxxxx/ > > > [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ > > > > > > > > > > > > > > > > > For instance, if you require large folios, the LRU list's tail could > > > > > be filled with small folios. > > > > > LRU(LF- large folio, SF- small folio): > > > > > > > > > > LF - LF - LF - SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF > > > > > > > > > > You might end up reclaiming many small folios yet still struggle to > > > > > allocate large folios. Conversely, > > > > > the inverse scenario can occur when the LRU list's tail is populated > > > > > with large folios. > > > > > > > > > > SF - SF - SF - LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF > > > > > > > > > > In OPPO's products, we allocate dedicated pageblocks solely for large > > > > > folios allocation, and we've > > > > > fine-tuned the LRU mechanism to support dual LRU—one for small folios > > > > > and another for large ones. > > > > > Dedicated page blocks offer a fundamental guarantee of allocating > > > > > large folios. Additionally, segregating > > > > > small and large folios into two LRUs ensures that both can be > > > > > efficiently reclaimed for their respective > > > > > users' requests. However, while the implementation may lack aesthetic > > > > > appeal and is primarily tailored > > > > > for product purposes, it isn't fully upstreamable. > > > > > > > > > > You can obtain the architectural diagram of OPPO's approach from link[2]. > > > > > > > > > > Therefore, my plan is to present: > > > > > > > > > > - Introduce the architecture of OPPO's mTHP-like approach, which > > > > > encompasses additional optimizations > > > > > we've made to address swap fragmentation issues and improve swap > > > > > performance, such as dual-zRAM > > > > > and compression/decompression of large folios [3]. > > > > > > > > > > - Present OPPO's method of utilizing dedicated page blocks and a > > > > > dual-LRU system for mTHP. > > > > > > > > > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones. > > > > > > > > > > - Discuss our future direction—are we leaning towards TAO or dedicated > > > > > page blocks? If we opt for page > > > > > blocks, how do we plan to resolve the LRU issue? > > > > > > > > > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@xxxxxxxxxx/ > > > > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png > > > > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@xxxxxxxxx/ > > > > > > > Thanks, > Barry