Hi All, On Wed, Aug 21, 2024 at 3:46 PM <hanchuanhua@xxxxxxxx> wrote: > > From: Chuanhua Han <hanchuanhua@xxxxxxxx> > > Currently, we have mTHP features, but unfortunately, without support for > large folio swap-ins, once these large folios are swapped out, they are > lost because mTHP swap is a one-way process. The lack of mTHP swap-in > functionality prevents mTHP from being used on devices like Android that > heavily rely on swap. > > This patch introduces mTHP swap-in support. It starts from sync devices > such as zRAM. This is probably the simplest and most common use case, > benefiting billions of Android phones and similar devices with minimal > implementation cost. In this straightforward scenario, large folios are > always exclusive, eliminating the need to handle complex rmap and > swapcache issues. > > It offers several benefits: > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after > swap-out and swap-in. Large folios in the buddy system are also > preserved as much as possible, rather than being fragmented due > to swap-in. > > 2. Eliminates fragmentation in swap slots and supports successful > THP_SWPOUT. > > w/o this patch (Refer to the data from Chris's and Kairui's latest > swap allocator optimization while running ./thp_swap_allocator_test > w/o "-a" option [1]): > > ./thp_swap_allocator_test > Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53% > Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58% > Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34% > Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51% > Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84% > Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91% > Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05% > Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25% > Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74% > Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01% > Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45% > Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98% > Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64% > Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36% > Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02% > Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07% > > w/ this patch (always 0%): > Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable > zsmalloc compression/decompression with larger granularity[2]. The upcoming > optimization in zsmalloc will significantly increase swap speed and improve > compression efficiency. Tested by running 100 iterations of swapping 100MiB > of anon memory, the swap speed improved dramatically: > time consumption of swapin(ms) time consumption of swapout(ms) > lz4 4k 45274 90540 > lz4 64k 22942 55667 > zstdn 4k 85035 186585 > zstdn 64k 46558 118533 > > The compression ratio also improved, as evaluated with 1 GiB of data: > granularity orig_data_size compr_data_size > 4KiB-zstd 1048576000 246876055 > 64KiB-zstd 1048576000 199763892 > > Without mTHP swap-in, the potential optimizations in zsmalloc cannot be > realized. > > 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor > of nr_pages. Swapping in content filled with the same data 0x11, w/o > and w/ the patch for five rounds (Since the content is the same, > decompression will be very fast. This primarily assesses the impact of > reduced page faults): > > swp in bandwidth(bytes/ms) w/o w/ > round1 624152 1127501 > round2 631672 1127501 > round3 620459 1139756 > round4 606113 1139756 > round5 624152 1152281 > avg 621310 1137359 +83% > > [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@xxxxxxxxxx/ > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@xxxxxxxxx/ > > Signed-off-by: Chuanhua Han <hanchuanhua@xxxxxxxx> > Co-developed-by: Barry Song <v-songbaohua@xxxxxxxx> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > --- > mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 223 insertions(+), 27 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index b9fe2f354878..7aa0358a4160 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3986,6 +3986,184 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > return VM_FAULT_SIGBUS; > } > > +static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > +{ > + struct vm_area_struct *vma = vmf->vma; > + struct folio *folio; > + swp_entry_t entry; > + > + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, > + vmf->address, false); > + if (!folio) > + return NULL; > + > + entry = pte_to_swp_entry(vmf->orig_pte); > + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > + GFP_KERNEL, entry)) { > + folio_put(folio); > + return NULL; > + } > + > + return folio; > +} > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +/* > + * Check if the PTEs within a range are contiguous swap entries > + * and have no cache when check_no_cache is true. > + */ > +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, > + int nr_pages, bool check_no_cache) > +{ > + struct swap_info_struct *si; > + unsigned long addr; > + swp_entry_t entry; > + pgoff_t offset; > + int idx, i; > + pte_t pte; > + > + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); > + idx = (vmf->address - addr) / PAGE_SIZE; > + pte = ptep_get(ptep); > + > + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) > + return false; > + entry = pte_to_swp_entry(pte); > + offset = swp_offset(entry); > + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) > + return false; > + > + if (!check_no_cache) > + return true; > + > + si = swp_swap_info(entry); > + /* > + * While allocating a large folio and doing swap_read_folio, which is > + * the case the being faulted pte doesn't have swapcache. We need to > + * ensure all PTEs have no cache as well, otherwise, we might go to > + * swap devices while the content is in swapcache. > + */ > + for (i = 0; i < nr_pages; i++) { > + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) > + return false; > + } > + > + return true; > +} > + > +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > + unsigned long addr, > + unsigned long orders) > +{ > + int order, nr; > + > + order = highest_order(orders); > + > + /* > + * To swap in a THP with nr pages, we require that its first swap_offset > + * is aligned with that number, as it was when the THP was swapped out. > + * This helps filter out most invalid entries. > + */ > + while (orders) { > + nr = 1 << order; > + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) > + break; > + order = next_order(&orders, order); > + } > + > + return orders; > +} > + > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + struct vm_area_struct *vma = vmf->vma; > + unsigned long orders; > + struct folio *folio; > + unsigned long addr; > + swp_entry_t entry; > + spinlock_t *ptl; > + pte_t *pte; > + gfp_t gfp; > + int order; > + > + /* > + * If uffd is active for the vma we need per-page fault fidelity to > + * maintain the uffd semantics. > + */ > + if (unlikely(userfaultfd_armed(vma))) > + goto fallback; > + > + /* > + * A large swapped out folio could be partially or fully in zswap. We > + * lack handling for such cases, so fallback to swapping in order-0 > + * folio. > + */ > + if (!zswap_never_enabled()) > + goto fallback; > + > + entry = pte_to_swp_entry(vmf->orig_pte); > + /* > + * Get a list of all the (large) orders below PMD_ORDER that are enabled > + * and suitable for swapping THP. > + */ > + orders = thp_vma_allowable_orders(vma, vma->vm_flags, > + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > + orders = thp_vma_suitable_orders(vma, vmf->address, orders); > + orders = thp_swap_suitable_orders(swp_offset(entry), > + vmf->address, orders); > + > + if (!orders) > + goto fallback; > + > + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, > + vmf->address & PMD_MASK, &ptl); > + if (unlikely(!pte)) > + goto fallback; > + > + /* > + * For do_swap_page, find the highest order where the aligned range is > + * completely swap entries with contiguous swap offsets. > + */ > + order = highest_order(orders); > + while (orders) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order, true)) > + break; > + order = next_order(&orders, order); > + } > + > + pte_unmap_unlock(pte, ptl); > + > + /* Try allocating the highest of the remaining orders. */ > + gfp = vma_thp_gfp_mask(vma); > + while (orders) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) { > + if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > + gfp, entry)) > + return folio; > + folio_put(folio); > + } > + order = next_order(&orders, order); > + } > + > +fallback: > + return __alloc_swap_folio(vmf); > +} > +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ > +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, > + int nr_pages, bool check_no_cache) > +{ > + return false; > +} > + > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + return __alloc_swap_folio(vmf); > +} > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > /* > * We enter with non-exclusive mmap_lock (to exclude vma changes, > * but allow concurrent faults), and pte mapped but not yet locked. > @@ -4074,34 +4252,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) == 1) { > - /* > - * Prevent parallel swapin from proceeding with > - * the cache flag. Otherwise, another thread may > - * finish swapin first, free the entry, and swapout > - * reusing the same entry. It's undetectable as > - * pte_same() returns true due to entry reuse. > - */ > - if (swapcache_prepare(entry, 1)) { > - /* Relax a bit to prevent rapid repeated page faults */ > - schedule_timeout_uninterruptible(1); > - goto out; > - } > - need_clear_cache = true; > - > /* skip swapcache */ > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > - vma, vmf->address, false); > + folio = alloc_swap_folio(vmf); > if (folio) { > __folio_set_locked(folio); > __folio_set_swapbacked(folio); > > - if (mem_cgroup_swapin_charge_folio(folio, > - vma->vm_mm, GFP_KERNEL, > - entry)) { > - ret = VM_FAULT_OOM; > + nr_pages = folio_nr_pages(folio); > + if (folio_test_large(folio)) > + entry.val = ALIGN_DOWN(entry.val, nr_pages); > + /* > + * Prevent parallel swapin from proceeding with > + * the cache flag. Otherwise, another thread > + * may finish swapin first, free the entry, and > + * swapout reusing the same entry. It's > + * undetectable as pte_same() returns true due > + * to entry reuse. > + */ > + if (swapcache_prepare(entry, nr_pages)) { > + /* > + * Relax a bit to prevent rapid > + * repeated page faults. > + */ > + schedule_timeout_uninterruptible(1); > goto out_page; > } > - mem_cgroup_swapin_uncharge_swap(entry, 1); > + need_clear_cache = true; > + > + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); > > shadow = get_shadow_from_swap_cache(entry); > if (shadow) > @@ -4207,6 +4385,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > goto out_nomap; > } > > + /* allocated large folios for SWP_SYNCHRONOUS_IO */ > + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { > + unsigned long nr = folio_nr_pages(folio); > + unsigned long folio_start = ALIGN_DOWN(vmf->address, > + nr * PAGE_SIZE); > + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; > + pte_t *folio_ptep = vmf->pte - idx; > + > + if (!can_swapin_thp(vmf, folio_ptep, nr, false)) > + goto out_nomap; > + > + page_idx = idx; > + address = folio_start; > + ptep = folio_ptep; > + goto check_folio; > + } > + > nr_pages = 1; > page_idx = 0; > address = vmf->address; > @@ -4338,11 +4533,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > folio_add_lru_vma(folio, vma); > } else if (!folio_test_anon(folio)) { > /* > - * We currently only expect small !anon folios, which are either > - * fully exclusive or fully shared. If we ever get large folios > - * here, we have to be careful. > + * We currently only expect small !anon folios which are either > + * fully exclusive or fully shared, or new allocated large > + * folios which are fully exclusive. If we ever get large > + * folios within swapcache here, we have to be careful. > */ > - VM_WARN_ON_ONCE(folio_test_large(folio)); > + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); > VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > folio_add_new_anon_rmap(folio, vma, address, rmap_flags); > } else { > @@ -4385,7 +4581,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > out: > /* Clear the swap cache pin for direct swapin after PTL unlock */ > if (need_clear_cache) > - swapcache_clear(si, entry, 1); > + swapcache_clear(si, entry, nr_pages); > if (si) > put_swap_device(si); > return ret; > @@ -4401,7 +4597,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > folio_put(swapcache); > } > if (need_clear_cache) > - swapcache_clear(si, entry, 1); > + swapcache_clear(si, entry, nr_pages); > if (si) > put_swap_device(si); > return ret; > -- > 2.43.0 > With latest mm-unstable, I'm seeing following WARN followed by user space segfaults (multiple mTHP enabled): [ 39.145686] ------------[ cut here ]------------ [ 39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535 swap_read_folio+0x4db/0x520 [ 39.146307] Modules linked in: [ 39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not tainted 6.11.0-rc6.orig+ #131 [ 39.146887] Hardware name: Tencent Cloud CVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014 [ 39.147206] RIP: 0010:swap_read_folio+0x4db/0x520 [ 39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e 82 48 [ 39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202 [ 39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03ffffffffffffff [ 39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064d4c00 [ 39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 0000000000000078 [ 39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 0000000000001000 [ 39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 0000000000000003 [ 39.150177] FS: 00007f51a56c9540(0000) GS:ffff888fffc00000(0000) knlGS:0000000000000000 [ 39.150623] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003506b0 [ 39.151317] Call Trace: [ 39.151565] <TASK> [ 39.151778] ? __warn+0x84/0x130 [ 39.152044] ? swap_read_folio+0x4db/0x520 [ 39.152345] ? report_bug+0xfc/0x1e0 [ 39.152614] ? handle_bug+0x3f/0x70 [ 39.152891] ? exc_invalid_op+0x17/0x70 [ 39.153178] ? asm_exc_invalid_op+0x1a/0x20 [ 39.153467] ? swap_read_folio+0x4db/0x520 [ 39.153753] do_swap_page+0xc6d/0x14f0 [ 39.154054] ? srso_return_thunk+0x5/0x5f [ 39.154361] __handle_mm_fault+0x758/0x850 [ 39.154645] handle_mm_fault+0x134/0x340 [ 39.154945] do_user_addr_fault+0x2e5/0x760 [ 39.155245] exc_page_fault+0x6a/0x140 [ 39.155546] asm_exc_page_fault+0x26/0x30 [ 39.155847] RIP: 0033:0x55627b071446 [ 39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25 72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da 03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45 85 c9 [ 39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246 [ 39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44f968b [ 39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000ffffffff [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 [ 39.158998] </TASK> [ 39.159226] ---[ end trace 0000000000000000 ]--- After reverting this or Usama's "mm: store zero pages to be swapped out in a bitmap", the problem is gone. I think these two patches may have some conflict that needs to be resolved.