Re: [PATCH v7 2/2] mm: support large folios swap-in for sync io devices

Kairui Song <ryncsn@xxxxxxxxx> · Wed, 4 Sep 2024 02:24:39 +0800

Hi All,

On Wed, Aug 21, 2024 at 3:46 PM <hanchuanhua@xxxxxxxx> wrote:
>
> From: Chuanhua Han <hanchuanhua@xxxxxxxx>
>
> Currently, we have mTHP features, but unfortunately, without support for
> large folio swap-ins, once these large folios are swapped out, they are
> lost because mTHP swap is a one-way process. The lack of mTHP swap-in
> functionality prevents mTHP from being used on devices like Android that
> heavily rely on swap.
>
> This patch introduces mTHP swap-in support. It starts from sync devices
> such as zRAM. This is probably the simplest and most common use case,
> benefiting billions of Android phones and similar devices with minimal
> implementation cost. In this straightforward scenario, large folios are
> always exclusive, eliminating the need to handle complex rmap and
> swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in. Large folios in the buddy system are also
>    preserved as much as possible, rather than being fragmented due
>    to swap-in.
>
> 2. Eliminates fragmentation in swap slots and supports successful
>    THP_SWPOUT.
>
>    w/o this patch (Refer to the data from Chris's and Kairui's latest
>    swap allocator optimization while running ./thp_swap_allocator_test
>    w/o "-a" option [1]):
>
>    ./thp_swap_allocator_test
>    Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
>    Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
>    Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
>    Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
>    Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
>    Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
>    Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
>    Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
>    Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
>    Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
>    Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
>    Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
>    Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
>    Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
>    Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
>    Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
>
>    w/ this patch (always 0%):
>    Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    ...
>
> 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
>    zsmalloc compression/decompression with larger granularity[2]. The upcoming
>    optimization in zsmalloc will significantly increase swap speed and improve
>    compression efficiency. Tested by running 100 iterations of swapping 100MiB
>    of anon memory, the swap speed improved dramatically:
>                 time consumption of swapin(ms)   time consumption of swapout(ms)
>      lz4 4k                  45274                    90540
>      lz4 64k                 22942                    55667
>      zstdn 4k                85035                    186585
>      zstdn 64k               46558                    118533
>
>     The compression ratio also improved, as evaluated with 1 GiB of data:
>      granularity   orig_data_size   compr_data_size
>      4KiB-zstd      1048576000       246876055
>      64KiB-zstd     1048576000       199763892
>
>    Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
>    realized.
>
> 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
>    of nr_pages. Swapping in content filled with the same data 0x11, w/o
>    and w/ the patch for five rounds (Since the content is the same,
>    decompression will be very fast. This primarily assesses the impact of
>    reduced page faults):
>
>   swp in bandwidth(bytes/ms)    w/o              w/
>    round1                     624152          1127501
>    round2                     631672          1127501
>    round3                     620459          1139756
>    round4                     606113          1139756
>    round5                     624152          1152281
>    avg                        621310          1137359      +83%
>
> [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@xxxxxxxxxx/
> [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@xxxxxxxxx/
>
> Signed-off-by: Chuanhua Han <hanchuanhua@xxxxxxxx>
> Co-developed-by: Barry Song <v-songbaohua@xxxxxxxx>
> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> ---
>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 223 insertions(+), 27 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b9fe2f354878..7aa0358a4160 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3986,6 +3986,184 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>         return VM_FAULT_SIGBUS;
>  }
>
> +static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +       struct folio *folio;
> +       swp_entry_t entry;
> +
> +       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
> +                               vmf->address, false);
> +       if (!folio)
> +               return NULL;
> +
> +       entry = pte_to_swp_entry(vmf->orig_pte);
> +       if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +                                          GFP_KERNEL, entry)) {
> +               folio_put(folio);
> +               return NULL;
> +       }
> +
> +       return folio;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Check if the PTEs within a range are contiguous swap entries
> + * and have no cache when check_no_cache is true.
> + */
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +                          int nr_pages, bool check_no_cache)
> +{
> +       struct swap_info_struct *si;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       pgoff_t offset;
> +       int idx, i;
> +       pte_t pte;
> +
> +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +       idx = (vmf->address - addr) / PAGE_SIZE;
> +       pte = ptep_get(ptep);
> +
> +       if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> +               return false;
> +       entry = pte_to_swp_entry(pte);
> +       offset = swp_offset(entry);
> +       if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
> +               return false;
> +
> +       if (!check_no_cache)
> +               return true;
> +
> +       si = swp_swap_info(entry);
> +       /*
> +        * While allocating a large folio and doing swap_read_folio, which is
> +        * the case the being faulted pte doesn't have swapcache. We need to
> +        * ensure all PTEs have no cache as well, otherwise, we might go to
> +        * swap devices while the content is in swapcache.
> +        */
> +       for (i = 0; i < nr_pages; i++) {
> +               if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
> +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> +                                                    unsigned long addr,
> +                                                    unsigned long orders)
> +{
> +       int order, nr;
> +
> +       order = highest_order(orders);
> +
> +       /*
> +        * To swap in a THP with nr pages, we require that its first swap_offset
> +        * is aligned with that number, as it was when the THP was swapped out.
> +        * This helps filter out most invalid entries.
> +        */
> +       while (orders) {
> +               nr = 1 << order;
> +               if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       return orders;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +       unsigned long orders;
> +       struct folio *folio;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       spinlock_t *ptl;
> +       pte_t *pte;
> +       gfp_t gfp;
> +       int order;
> +
> +       /*
> +        * If uffd is active for the vma we need per-page fault fidelity to
> +        * maintain the uffd semantics.
> +        */
> +       if (unlikely(userfaultfd_armed(vma)))
> +               goto fallback;
> +
> +       /*
> +        * A large swapped out folio could be partially or fully in zswap. We
> +        * lack handling for such cases, so fallback to swapping in order-0
> +        * folio.
> +        */
> +       if (!zswap_never_enabled())
> +               goto fallback;
> +
> +       entry = pte_to_swp_entry(vmf->orig_pte);
> +       /*
> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +        * and suitable for swapping THP.
> +        */
> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +       orders = thp_swap_suitable_orders(swp_offset(entry),
> +                                         vmf->address, orders);
> +
> +       if (!orders)
> +               goto fallback;
> +
> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> +                                 vmf->address & PMD_MASK, &ptl);
> +       if (unlikely(!pte))
> +               goto fallback;
> +
> +       /*
> +        * For do_swap_page, find the highest order where the aligned range is
> +        * completely swap entries with contiguous swap offsets.
> +        */
> +       order = highest_order(orders);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order, true))
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       pte_unmap_unlock(pte, ptl);
> +
> +       /* Try allocating the highest of the remaining orders. */
> +       gfp = vma_thp_gfp_mask(vma);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +               if (folio) {
> +                       if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +                                                           gfp, entry))
> +                               return folio;
> +                       folio_put(folio);
> +               }
> +               order = next_order(&orders, order);
> +       }
> +
> +fallback:
> +       return __alloc_swap_folio(vmf);
> +}
> +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +                                 int nr_pages, bool check_no_cache)
> +{
> +       return false;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       return __alloc_swap_folio(vmf);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4074,34 +4252,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
> -                       /*
> -                        * Prevent parallel swapin from proceeding with
> -                        * the cache flag. Otherwise, another thread may
> -                        * finish swapin first, free the entry, and swapout
> -                        * reusing the same entry. It's undetectable as
> -                        * pte_same() returns true due to entry reuse.
> -                        */
> -                       if (swapcache_prepare(entry, 1)) {
> -                               /* Relax a bit to prevent rapid repeated page faults */
> -                               schedule_timeout_uninterruptible(1);
> -                               goto out;
> -                       }
> -                       need_clear_cache = true;
> -
>                         /* skip swapcache */
> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                                               vma, vmf->address, false);
> +                       folio = alloc_swap_folio(vmf);
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>
> -                               if (mem_cgroup_swapin_charge_folio(folio,
> -                                                       vma->vm_mm, GFP_KERNEL,
> -                                                       entry)) {
> -                                       ret = VM_FAULT_OOM;
> +                               nr_pages = folio_nr_pages(folio);
> +                               if (folio_test_large(folio))
> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> +                               /*
> +                                * Prevent parallel swapin from proceeding with
> +                                * the cache flag. Otherwise, another thread
> +                                * may finish swapin first, free the entry, and
> +                                * swapout reusing the same entry. It's
> +                                * undetectable as pte_same() returns true due
> +                                * to entry reuse.
> +                                */
> +                               if (swapcache_prepare(entry, nr_pages)) {
> +                                       /*
> +                                        * Relax a bit to prevent rapid
> +                                        * repeated page faults.
> +                                        */
> +                                       schedule_timeout_uninterruptible(1);
>                                         goto out_page;
>                                 }
> -                               mem_cgroup_swapin_uncharge_swap(entry, 1);
> +                               need_clear_cache = true;
> +
> +                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>
>                                 shadow = get_shadow_from_swap_cache(entry);
>                                 if (shadow)
> @@ -4207,6 +4385,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out_nomap;
>         }
>
> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> +               unsigned long nr = folio_nr_pages(folio);
> +               unsigned long folio_start = ALIGN_DOWN(vmf->address,
> +                                                      nr * PAGE_SIZE);
> +               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
> +               pte_t *folio_ptep = vmf->pte - idx;
> +
> +               if (!can_swapin_thp(vmf, folio_ptep, nr, false))
> +                       goto out_nomap;
> +
> +               page_idx = idx;
> +               address = folio_start;
> +               ptep = folio_ptep;
> +               goto check_folio;
> +       }
> +
>         nr_pages = 1;
>         page_idx = 0;
>         address = vmf->address;
> @@ -4338,11 +4533,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_add_lru_vma(folio, vma);
>         } else if (!folio_test_anon(folio)) {
>                 /*
> -                * We currently only expect small !anon folios, which are either
> -                * fully exclusive or fully shared. If we ever get large folios
> -                * here, we have to be careful.
> +                * We currently only expect small !anon folios which are either
> +                * fully exclusive or fully shared, or new allocated large
> +                * folios which are fully exclusive. If we ever get large
> +                * folios within swapcache here, we have to be careful.
>                  */
> -               VM_WARN_ON_ONCE(folio_test_large(folio));
> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
>                 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>         } else {
> @@ -4385,7 +4581,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  out:
>         /* Clear the swap cache pin for direct swapin after PTL unlock */
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> @@ -4401,7 +4597,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_put(swapcache);
>         }
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> --
> 2.43.0
>

With latest mm-unstable, I'm seeing following WARN followed by user
space segfaults (multiple mTHP enabled):

[   39.145686] ------------[ cut here ]------------
[   39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535
swap_read_folio+0x4db/0x520
[   39.146307] Modules linked in:
[   39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not
tainted 6.11.0-rc6.orig+ #131
[   39.146887] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[   39.147206] RIP: 0010:swap_read_folio+0x4db/0x520
[   39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff
ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db
fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e
82 48
[   39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202
[   39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03ffffffffffffff
[   39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064d4c00
[   39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 0000000000000078
[   39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 0000000000001000
[   39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 0000000000000003
[   39.150177] FS:  00007f51a56c9540(0000) GS:ffff888fffc00000(0000)
knlGS:0000000000000000
[   39.150623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003506b0
[   39.151317] Call Trace:
[   39.151565]  <TASK>
[   39.151778]  ? __warn+0x84/0x130
[   39.152044]  ? swap_read_folio+0x4db/0x520
[   39.152345]  ? report_bug+0xfc/0x1e0
[   39.152614]  ? handle_bug+0x3f/0x70
[   39.152891]  ? exc_invalid_op+0x17/0x70
[   39.153178]  ? asm_exc_invalid_op+0x1a/0x20
[   39.153467]  ? swap_read_folio+0x4db/0x520
[   39.153753]  do_swap_page+0xc6d/0x14f0
[   39.154054]  ? srso_return_thunk+0x5/0x5f
[   39.154361]  __handle_mm_fault+0x758/0x850
[   39.154645]  handle_mm_fault+0x134/0x340
[   39.154945]  do_user_addr_fault+0x2e5/0x760
[   39.155245]  exc_page_fault+0x6a/0x140
[   39.155546]  asm_exc_page_fault+0x26/0x30
[   39.155847] RIP: 0033:0x55627b071446
[   39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25
72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da
03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45
85 c9
[   39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246
[   39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44f968b
[   39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000ffffffff
[   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
[   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
[   39.158998]  </TASK>
[   39.159226] ---[ end trace 0000000000000000 ]---

After reverting this or Usama's "mm: store zero pages to be swapped
out in a bitmap", the problem is gone. I think these two patches may
have some conflict that needs to be resolved.