From: Barry Song <v-songbaohua@xxxxxxxx> In an embedded system like Android, more than half of anonymous memory is actually stored in swap devices such as zRAM. For instance, when an app is switched to the background, most of its memory might be swapped out. Currently, we have mTHP features, but unfortunately, without support for large folio swap-ins, once those large folios are swapped out, we lose them immediately because mTHP is a one-way ticket. This is unacceptable and reduces mTHP to merely a toy on systems with significant swap utilization. This patch introduces mTHP swap-in support. For now, we limit mTHP swap-ins to contiguous swaps that were likely swapped out from mTHP as a whole. Additionally, the current implementation only covers the SWAP_SYNCHRONOUS case. This is the simplest and most common use case, benefiting millions of Android phones and similar devices with minimal implementation cost. In this straightforward scenario, large folios are always exclusive, eliminating the need to handle complex rmap and swapcache issues. It offers several benefits: 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after swap-out and swap-in. 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT without fragmentation. Based on the observed data [1] on Chris's and Ryan's THP swap allocation optimization, aligned swap-in plays a crucial role in the success of THP_SWPOUT. 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage and enhancing compression ratios significantly. We have another patchset to enable mTHP compression and decompression in zsmalloc/zRAM[2]. Using the readahead mechanism to decide whether to swap in mTHP doesn't seem to be an optimal approach. There's a critical distinction between pagecache and anonymous pages: pagecache can be evicted and later retrieved from disk, potentially becoming a mTHP upon retrieval, whereas anonymous pages must always reside in memory or swapfile. If we swap in small folios and identify adjacent memory suitable for swapping in as mTHP, those pages that have been converted to small folios may never transition to mTHP. The process of converting mTHP into small folios remains irreversible. This introduces the risk of losing all mTHP through several swap-out and swap-in cycles, let alone losing the benefits of defragmentation, improved compression ratios, and reduced CPU usage based on mTHP compression/decompression. Conversely, in deploying mTHP on millions of real-world products with this feature in OPPO's out-of-tree code[3], we haven't observed any significant increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64. [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@xxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@xxxxxxxxx/ [3] OnePlusOSS / android_kernel_oneplus_sm8550 https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11 -v5: * Add swap-in control policy according to Ying's proposal. Right now only "always" and "never" are supported, later we can extend to "auto"; * Fix the comment regarding zswap_never_enabled() according to Yosry; * Filter out unaligned swp entries earlier; * add mem_cgroup_swapin_uncharge_swap_nr() helper -v4: https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@xxxxxxxxx/ Many parts of v3 have been merged into the mm tree with the help on reviewing from Ryan, David, Ying and Chris etc. Thank you very much! This is the final part to allocate large folios and map them. * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix in this v4 RFC though it should be fixed in Yosry's patch * lots of code improvement (drop large stack, hold ptl etc) according to Yosry's and Ryan's feedback * rebased on top of the latest mm-unstable and utilized some new helpers introduced recently. -v3: https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@xxxxxxxxx/ * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry, thanks! * fix the issue folio is charged twice for do_swap_page, separating alloc_anon_folio and alloc_swap_folio as they have many differences now on * memcg charing * clearing allocated folio or not -v2: https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@xxxxxxxxx/ * lots of code cleanup according to Chris's comments, thanks! * collect Chris's ack tags, thanks! * address David's comment on moving to use folio_add_new_anon_rmap for !folio_test_anon in do_swap_page, thanks! * remove the MADV_PAGEOUT patch from this series as Ryan will intergrate it into swap-out series * Apply Kairui's work of "mm/swap: fix race when skipping swapcache" on large folios swap-in as well * fixed corrupted data(zero-filled data) in two races: zswap and a part of entries are in swapcache while some others are not in by checking SWAP_HAS_CACHE while swapping in a large folio -v1: https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@xxxxxxxxx/#t Barry Song (3): mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in mm: Introduce per-thpsize swapin control policy Chuanhua Han (1): mm: support large folios swapin as a whole for zRAM-like swapfile Documentation/admin-guide/mm/transhuge.rst | 6 + include/linux/huge_mm.h | 1 + include/linux/memcontrol.h | 12 ++ include/linux/swap.h | 9 +- mm/huge_memory.c | 44 +++++ mm/memory.c | 212 ++++++++++++++++++--- mm/swap.h | 10 +- mm/swapfile.c | 102 ++++++---- 8 files changed, 329 insertions(+), 67 deletions(-) -- 2.34.1