On Wed, Jul 3, 2024 at 7:58 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > > > Ying, thanks! > > > Barry Song <21cnbao@xxxxxxxxx> writes: > > > > > From: Barry Song <v-songbaohua@xxxxxxxx> > > > > > > In an embedded system like Android, more than half of anonymous memory is > > > actually stored in swap devices such as zRAM. For instance, when an app > > > is switched to the background, most of its memory might be swapped out. > > > > > > Currently, we have mTHP features, but unfortunately, without support > > > for large folio swap-ins, once those large folios are swapped out, > > > we lose them immediately because mTHP is a one-way ticket. > > > > No exactly one-way ticket, we have (or will have) khugepaged. But I > > admit that it may be not good enough for you. > > That's right. From what I understand, khugepaged currently only supports PMD THP > till now? > Moreover, I have concerns that khugepaged might not be suitable for > all mTHPs for > the following reasons: > > 1. The lifecycle of mTHP might not be that long. We paid the cost for > the collapse, > but it could swap-out just after that. We expect THP to be durable and > not become > obsolete quickly, given the significant amount of money we spent on it. > > 2. mTHP's size might not be substantial enough for a collapse. For > example, if we can > find an effective method, such as Yu's TAO or others, we can achieve a > high success > rate in mTHP allocations at a minimal cost rather than depending on > compaction/collapse. > > 3. It could be a significant challenge to manage the collapse - unmap, > and map processes > in relation to the power consumption of phones considering the number > of mTHP could > be much larger than PMD-mapped THP. This behavior could be quite often. > > > > > > This is unacceptable and reduces mTHP to merely a toy on systems > > > with significant swap utilization. > > > > May be true in your systems. May be not in some other systems. > > I agree that this isn't a concern for systems without significant > swapout and swapin activity. > However, on Android, where we frequently switch between applications > like YouTube, > Chrome, Zoom, WeChat, Alipay, TikTok, and others, swapping could occur > throughout the > day :-) > > > > > > This patch introduces mTHP swap-in support. For now, we limit mTHP > > > swap-ins to contiguous swaps that were likely swapped out from mTHP as > > > a whole. > > > > > > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS > > > case. This is the simplest and most common use case, benefiting millions > > > > I admit that Android is an important target platform of Linux kernel. > > But I will not advocate that it's MOST common ... > > Okay, I understand that there are still many embedded systems similar > to Android, even if > they are not Android :-) > > > > > > of Android phones and similar devices with minimal implementation > > > cost. In this straightforward scenario, large folios are always exclusive, > > > eliminating the need to handle complex rmap and swapcache issues. > > > > > > It offers several benefits: > > > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after > > > swap-out and swap-in. > > > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT > > > without fragmentation. Based on the observed data [1] on Chris's and Ryan's > > > THP swap allocation optimization, aligned swap-in plays a crucial role > > > in the success of THP_SWPOUT. > > > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage > > > and enhancing compression ratios significantly. We have another patchset > > > to enable mTHP compression and decompression in zsmalloc/zRAM[2]. > > > > > > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem > > > to be an optimal approach. There's a critical distinction between pagecache > > > and anonymous pages: pagecache can be evicted and later retrieved from disk, > > > potentially becoming a mTHP upon retrieval, whereas anonymous pages must > > > always reside in memory or swapfile. If we swap in small folios and identify > > > adjacent memory suitable for swapping in as mTHP, those pages that have been > > > converted to small folios may never transition to mTHP. The process of > > > converting mTHP into small folios remains irreversible. This introduces > > > the risk of losing all mTHP through several swap-out and swap-in cycles, > > > let alone losing the benefits of defragmentation, improved compression > > > ratios, and reduced CPU usage based on mTHP compression/decompression. > > > > I understand that the most optimal policy in your use cases may be > > always swapping-in mTHP in highest order. But, it may be not in some > > other use cases. For example, relative slow swap devices, non-fault > > sub-pages swapped out again before usage, etc. > > > > So, IMO, the default policy should be the one that can adapt to the > > requirements automatically. For example, if most non-fault sub-pages > > will be read/written before being swapped out again, we should swap-in > > in larger order, otherwise in smaller order. Swap readahead is one > > possible way to do that. But, I admit that this may not work perfectly > > in your use cases. > > > > Previously I hope that we can start with this automatic policy that > > helps everyone, then check whether it can satisfy your requirements > > before implementing the optimal policy for you. But it appears that you > > don't agree with this. > > > > Based on the above, IMO, we should not use your policy as default at > > least for now. A user space interface can be implemented to select > > different swap-in order policy similar as that of mTHP allocation order > > policy. We need a different policy because the performance characters > > of the memory allocation is quite different from that of swap-in. For > > example, the SSD reading could be much slower than the memory > > allocation. With the policy selection, I think that we can implement > > mTHP swap-in for non-SWAP_SYNCHRONOUS too. Users need to know what they > > are doing. > > Agreed. Ryan also suggested something similar before. > Could we add this user policy by: > > /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled > which could be 0 or 1, I assume we don't need so many "always inherit > madvise never"? I actually meant: Firstly, we respect the existing THP policy, and then we incorporate swapin_enabled after checking both allowable and suitable, pseudo code like this, orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swapin_allowable_order(orders); > > Do you have any suggestions regarding the user interface? > > > > > > Conversely, in deploying mTHP on millions of real-world products with this > > > feature in OPPO's out-of-tree code[3], we haven't observed any significant > > > increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64. > > > > > > [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@xxxxxxxxx/ > > > [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@xxxxxxxxx/ > > > [3] OnePlusOSS / android_kernel_oneplus_sm8550 > > > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11 > > > > > > > [snip] > > > > -- > > Best Regards, > > Huang, Ying > > Thanks > Barry