Re: [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile

Barry Song <21cnbao@xxxxxxxxx> · Wed, 3 Jul 2024 20:32:22 +1200

On Wed, Jul 3, 2024 at 7:58 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >
>
> Ying, thanks!
>
> > Barry Song <21cnbao@xxxxxxxxx> writes:
> >
> > > From: Barry Song <v-songbaohua@xxxxxxxx>
> > >
> > > In an embedded system like Android, more than half of anonymous memory is
> > > actually stored in swap devices such as zRAM. For instance, when an app
> > > is switched to the background, most of its memory might be swapped out.
> > >
> > > Currently, we have mTHP features, but unfortunately, without support
> > > for large folio swap-ins, once those large folios are swapped out,
> > > we lose them immediately because mTHP is a one-way ticket.
> >
> > No exactly one-way ticket, we have (or will have) khugepaged.  But I
> > admit that it may be not good enough for you.
>
> That's right. From what I understand, khugepaged currently only supports PMD THP
> till now?
> Moreover, I have concerns that khugepaged might not be suitable for
> all mTHPs for
> the following reasons:
>
> 1. The lifecycle of mTHP might not be that long. We paid the cost for
> the collapse,
> but it could swap-out just after that. We expect THP to be durable and
> not become
> obsolete quickly, given the significant amount of money we spent on it.
>
> 2. mTHP's size might not be substantial enough for a collapse. For
> example, if we can
> find an effective method, such as Yu's TAO or others, we can achieve a
> high success
> rate in mTHP allocations at a minimal cost rather than depending on
> compaction/collapse.
>
> 3. It could be a significant challenge to manage the collapse - unmap,
> and map processes
> in relation to the power consumption of phones considering the number
> of mTHP could
> be much larger than PMD-mapped THP. This behavior could be quite often.
>
> >
> > > This is unacceptable and reduces mTHP to merely a toy on systems
> > > with significant swap utilization.
> >
> > May be true in your systems.  May be not in some other systems.
>
> I agree that this isn't a concern for systems without significant
> swapout and swapin activity.
> However, on Android, where we frequently switch between applications
> like YouTube,
> Chrome, Zoom, WeChat, Alipay, TikTok, and others, swapping could occur
> throughout the
> day :-)
>
> >
> > > This patch introduces mTHP swap-in support. For now, we limit mTHP
> > > swap-ins to contiguous swaps that were likely swapped out from mTHP as
> > > a whole.
> > >
> > > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> > > case. This is the simplest and most common use case, benefiting millions
> >
> > I admit that Android is an important target platform of Linux kernel.
> > But I will not advocate that it's MOST common ...
>
> Okay, I understand that there are still many embedded systems similar
> to Android, even if
> they are not Android :-)
>
> >
> > > of Android phones and similar devices with minimal implementation
> > > cost. In this straightforward scenario, large folios are always exclusive,
> > > eliminating the need to handle complex rmap and swapcache issues.
> > >
> > > It offers several benefits:
> > > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
> > >    swap-out and swap-in.
> > > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
> > >    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
> > >    THP swap allocation optimization, aligned swap-in plays a crucial role
> > >    in the success of THP_SWPOUT.
> > > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
> > >    and enhancing compression ratios significantly. We have another patchset
> > >    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
> > >
> > > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> > > to be an optimal approach. There's a critical distinction between pagecache
> > > and anonymous pages: pagecache can be evicted and later retrieved from disk,
> > > potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> > > always reside in memory or swapfile. If we swap in small folios and identify
> > > adjacent memory suitable for swapping in as mTHP, those pages that have been
> > > converted to small folios may never transition to mTHP. The process of
> > > converting mTHP into small folios remains irreversible. This introduces
> > > the risk of losing all mTHP through several swap-out and swap-in cycles,
> > > let alone losing the benefits of defragmentation, improved compression
> > > ratios, and reduced CPU usage based on mTHP compression/decompression.
> >
> > I understand that the most optimal policy in your use cases may be
> > always swapping-in mTHP in highest order.  But, it may be not in some
> > other use cases.  For example, relative slow swap devices, non-fault
> > sub-pages swapped out again before usage, etc.
> >
> > So, IMO, the default policy should be the one that can adapt to the
> > requirements automatically.  For example, if most non-fault sub-pages
> > will be read/written before being swapped out again, we should swap-in
> > in larger order, otherwise in smaller order.  Swap readahead is one
> > possible way to do that.  But, I admit that this may not work perfectly
> > in your use cases.
> >
> > Previously I hope that we can start with this automatic policy that
> > helps everyone, then check whether it can satisfy your requirements
> > before implementing the optimal policy for you.  But it appears that you
> > don't agree with this.
> >
> > Based on the above, IMO, we should not use your policy as default at
> > least for now.  A user space interface can be implemented to select
> > different swap-in order policy similar as that of mTHP allocation order
> > policy.  We need a different policy because the performance characters
> > of the memory allocation is quite different from that of swap-in.  For
> > example, the SSD reading could be much slower than the memory
> > allocation.  With the policy selection, I think that we can implement
> > mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
> > are doing.
>
> Agreed. Ryan also suggested something similar before.
> Could we add this user policy by:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
> which could be 0 or 1, I assume we don't need so many "always inherit
> madvise never"?

I actually meant:

Firstly, we respect the existing THP policy, and then we incorporate
swapin_enabled after checking both allowable and suitable, pseudo
code like this,

        orders = thp_vma_allowable_orders(vma, vma->vm_flags,
                        TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);

        orders = thp_swapin_allowable_order(orders);

>
> Do you have any suggestions regarding the user interface?
>
> >
> > > Conversely, in deploying mTHP on millions of real-world products with this
> > > feature in OPPO's out-of-tree code[3], we haven't observed any significant
> > > increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@xxxxxxxxx/
> > > [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@xxxxxxxxx/
> > > [3] OnePlusOSS / android_kernel_oneplus_sm8550
> > > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> > >
> >
> > [snip]
> >
> > --
> > Best Regards,
> > Huang, Ying
>
> Thanks
> Barry