Barry Song <21cnbao@xxxxxxxxx> writes: > On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> > > Ying, thanks! > >> Barry Song <21cnbao@xxxxxxxxx> writes: [snip] >> > This patch introduces mTHP swap-in support. For now, we limit mTHP >> > swap-ins to contiguous swaps that were likely swapped out from mTHP as >> > a whole. >> > >> > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS >> > case. This is the simplest and most common use case, benefiting millions >> >> I admit that Android is an important target platform of Linux kernel. >> But I will not advocate that it's MOST common ... > > Okay, I understand that there are still many embedded systems similar > to Android, even if > they are not Android :-) > >> >> > of Android phones and similar devices with minimal implementation >> > cost. In this straightforward scenario, large folios are always exclusive, >> > eliminating the need to handle complex rmap and swapcache issues. >> > >> > It offers several benefits: >> > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after >> > swap-out and swap-in. >> > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT >> > without fragmentation. Based on the observed data [1] on Chris's and Ryan's >> > THP swap allocation optimization, aligned swap-in plays a crucial role >> > in the success of THP_SWPOUT. >> > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage >> > and enhancing compression ratios significantly. We have another patchset >> > to enable mTHP compression and decompression in zsmalloc/zRAM[2]. >> > >> > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem >> > to be an optimal approach. There's a critical distinction between pagecache >> > and anonymous pages: pagecache can be evicted and later retrieved from disk, >> > potentially becoming a mTHP upon retrieval, whereas anonymous pages must >> > always reside in memory or swapfile. If we swap in small folios and identify >> > adjacent memory suitable for swapping in as mTHP, those pages that have been >> > converted to small folios may never transition to mTHP. The process of >> > converting mTHP into small folios remains irreversible. This introduces >> > the risk of losing all mTHP through several swap-out and swap-in cycles, >> > let alone losing the benefits of defragmentation, improved compression >> > ratios, and reduced CPU usage based on mTHP compression/decompression. >> >> I understand that the most optimal policy in your use cases may be >> always swapping-in mTHP in highest order. But, it may be not in some >> other use cases. For example, relative slow swap devices, non-fault >> sub-pages swapped out again before usage, etc. >> >> So, IMO, the default policy should be the one that can adapt to the >> requirements automatically. For example, if most non-fault sub-pages >> will be read/written before being swapped out again, we should swap-in >> in larger order, otherwise in smaller order. Swap readahead is one >> possible way to do that. But, I admit that this may not work perfectly >> in your use cases. >> >> Previously I hope that we can start with this automatic policy that >> helps everyone, then check whether it can satisfy your requirements >> before implementing the optimal policy for you. But it appears that you >> don't agree with this. >> >> Based on the above, IMO, we should not use your policy as default at >> least for now. A user space interface can be implemented to select >> different swap-in order policy similar as that of mTHP allocation order >> policy. We need a different policy because the performance characters >> of the memory allocation is quite different from that of swap-in. For >> example, the SSD reading could be much slower than the memory >> allocation. With the policy selection, I think that we can implement >> mTHP swap-in for non-SWAP_SYNCHRONOUS too. Users need to know what they >> are doing. > > Agreed. Ryan also suggested something similar before. > Could we add this user policy by: > > /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled > which could be 0 or 1, I assume we don't need so many "always inherit > madvise never"? > > Do you have any suggestions regarding the user interface? /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled looks good to me. To be consistent with "enabled" in the same directory, and more importantly, to be extensible, I think that it's better to start with at least "always never". I believe that we will add "auto" in the future to tune automatically. Which can be used as default finally. -- Best Regards, Huang, Ying