Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails

Barry Song <21cnbao@xxxxxxxxx> · Mon, 25 Nov 2024 10:47:11 +1300

On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 21/11/2024 22:25, Barry Song wrote:
> > From: Barry Song <v-songbaohua@xxxxxxxx>
> >
> > The swapfile can compress/decompress at 4 * PAGES granularity, reducing
> > CPU usage and improving the compression ratio. However, if allocating an
> > mTHP fails and we fall back to a single small folio, the entire large
> > block must still be decompressed. This results in a 16 KiB area requiring
> > 4 page faults, where each fault decompresses 16 KiB but retrieves only
> > 4 KiB of data from the block. To address this inefficiency, we instead
> > fall back to 4 small folios, ensuring that each decompression occurs
> > only once.
> >
> > Allowing swap_read_folio() to decompress and read into an array of
> > 4 folios would be extremely complex, requiring extensive changes
> > throughout the stack, including swap_read_folio, zeromap,
> > zswap, and final swap implementations like zRAM. In contrast,
> > having these components fill a large folio with 4 subpages is much
> > simpler.
> >
> > To avoid a full-stack modification, we introduce a per-CPU order-2
> > large folio as a buffer. This buffer is used for swap_read_folio(),
> > after which the data is copied into the 4 small folios. Finally, in
> > do_swap_page(), all these small folios are mapped.
> >
> > Co-developed-by: Chuanhua Han <chuanhuahan@xxxxxxxxx>
> > Signed-off-by: Chuanhua Han <chuanhuahan@xxxxxxxxx>
> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> > ---
> >  mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 192 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 209885a4134f..e551570c1425 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> >       return folio;
> >  }
> >
> > +#define BATCH_SWPIN_ORDER 2
>
> Hi Barry,
>
> Thanks for the series and the numbers in the cover letter.
>
> Just a few things.
>
> Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2?

Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER
and always setting it to 2, which is the minimum anonymous mTHP order.  The main
reason is that it may be difficult for users to select the appropriate Kconfig?

On the other hand, 16KB provides the most advantages for zstd compression and
decompression with larger blocks. While increasing from 16KB to 32KB or 64KB
can offer additional benefits, the improvement is not as significant
as the jump from
4KB to 16KB.

As I use zstd to compress and decompress the 'Beyond Compare' software
package:

root@barry-desktop:~# ./zstd
File size: 182502912 bytes
4KB Block: Compression time = 0.765915 seconds, Decompression time =
0.203366 seconds
  Original size: 182502912 bytes
  Compressed size: 66089193 bytes
  Compression ratio: 36.21%
16KB Block: Compression time = 0.558595 seconds, Decompression time =
0.153837 seconds
  Original size: 182502912 bytes
  Compressed size: 59159073 bytes
  Compression ratio: 32.42%
32KB Block: Compression time = 0.538106 seconds, Decompression time =
0.137768 seconds
  Original size: 182502912 bytes
  Compressed size: 57958701 bytes
  Compression ratio: 31.76%
64KB Block: Compression time = 0.532212 seconds, Decompression time =
0.127592 seconds
  Original size: 182502912 bytes
  Compressed size: 56700795 bytes
  Compression ratio: 31.07%

In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER?

>
> Did you check the performance difference with and without patch 4?

I retested after reverting patch 4, and the sys time increased to over
40 minutes
again, though it was slightly better than without the entire series.

*** Executing round 1 ***

real 7m49.342s
user 80m53.675s
sys 42m28.393s
pswpin: 29965548
pswpout: 51127359
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11347712
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6641230
pgpgin: 147376000
pgpgout: 213343124

*** Executing round 2 ***

real 7m41.331s
user 81m16.631s
sys 41m39.845s
pswpin: 29208867
pswpout: 50006026
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11104912
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6483827
pgpgin: 144057340
pgpgout: 208887688

*** Executing round 3 ***

real 7m47.280s
user 78m36.767s
sys 37m32.210s
pswpin: 26426526
pswpout: 45420734
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 10104304
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 5884839
pgpgin: 132013648
pgpgout: 190537264

*** Executing round 4 ***

real 7m56.723s
user 80m36.837s
sys 41m35.979s
pswpin: 29367639
pswpout: 50059254
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11116176
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6514064
pgpgin: 144593828
pgpgout: 209080468

*** Executing round 5 ***

real 7m53.806s
user 80m30.953s
sys 40m14.870s
pswpin: 28091760
pswpout: 48495748
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 10779720
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6244819
pgpgin: 138813124
pgpgout: 202885480

I guess it is due to the occurrence of numerous partial reads
(about 10%, 3505537/35159852).

root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat

zram_bio write/read multi_pages count:54452828 35159852
zram_bio failed write/read multi_pages count       0        0
zram_bio partial write/read multi_pages count       4  3505537
multi_pages_miss_free        0

This workload doesn't cause fragmentation in the buddy allocator, so it’s
likely due to the failure of MEMCG_CHARGE.

>
> I know that it wont help if you have a lot of unmovable pages
> scattered everywhere, but were you able to compare the performance
> of defrag=always vs patch 4? I feel like if you have space for 4 folios
> then hopefully compaction should be able to do its job and you can
> directly fill the large folio if the unmovable pages are better placed.
> Johannes' series on preventing type mixing [1] would help.
>
> [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@xxxxxxxxxxx/

I believe this could help, but defragmentation is a complex issue. Especially on
phones, where various components like drivers, DMA-BUF, multimedia, and
graphics utilize memory.

We observed that a fresh system could initially provide mTHP, but after a few
hours, obtaining mTHP became very challenging. I'm happy to arrange a test
of Johannes' series on phones (sometimes it is quite hard to backport to the
Android kernel) to see if it brings any improvements.

>
> Thanks,
> Usama
>
> > +#define BATCH_SWPIN_COUNT (1 << BATCH_SWPIN_ORDER)
> > +#define BATCH_SWPIN_SIZE (PAGE_SIZE << BATCH_SWPIN_ORDER)
> > +
> > +struct batch_swpin_buffer {
> > +     struct folio *folio;
> > +     struct mutex mutex;
> > +};
> > +
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> >  {
> > @@ -4120,7 +4129,101 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> >       return orders;
> >  }
> >
> > -static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +static DEFINE_PER_CPU(struct batch_swpin_buffer, swp_buf);
> > +
> > +static int __init batch_swpin_buffer_init(void)
> > +{
> > +     int ret, cpu;
> > +     struct batch_swpin_buffer *buf;
> > +
> > +     for_each_possible_cpu(cpu) {
> > +             buf = per_cpu_ptr(&swp_buf, cpu);
> > +             buf->folio = (struct folio *)alloc_pages_node(cpu_to_node(cpu),
> > +                             GFP_KERNEL | __GFP_COMP, BATCH_SWPIN_ORDER);
> > +             if (!buf->folio) {
> > +                     ret = -ENOMEM;
> > +                     goto err;
> > +             }
> > +             mutex_init(&buf->mutex);
> > +     }
> > +     return 0;
> > +
> > +err:
> > +     for_each_possible_cpu(cpu) {
> > +             buf = per_cpu_ptr(&swp_buf, cpu);
> > +             if (buf->folio) {
> > +                     folio_put(buf->folio);
> > +                     buf->folio = NULL;
> > +             }
> > +     }
> > +     return ret;
> > +}
> > +core_initcall(batch_swpin_buffer_init);
> > +
> > +static struct folio *alloc_batched_swap_folios(struct vm_fault *vmf,
> > +             struct batch_swpin_buffer **buf, struct folio **folios,
> > +             swp_entry_t entry)
> > +{
> > +     unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
> > +     struct batch_swpin_buffer *sbuf = raw_cpu_ptr(&swp_buf);
> > +     struct folio *folio = sbuf->folio;
> > +     unsigned long addr;
> > +     int i;
> > +
> > +     if (unlikely(!folio))
> > +             return NULL;
> > +
> > +     for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
> > +             addr = haddr + i * PAGE_SIZE;
> > +             folios[i] = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma, addr);
> > +             if (!folios[i])
> > +                     goto err;
> > +             if (mem_cgroup_swapin_charge_folio(folios[i], vmf->vma->vm_mm,
> > +                                     GFP_KERNEL, entry))
> > +                     goto err;
> > +     }
> > +
> > +     mutex_lock(&sbuf->mutex);
> > +     *buf = sbuf;
> > +#ifdef CONFIG_MEMCG
> > +     folio->memcg_data = (*folios)->memcg_data;
> > +#endif
> > +     return folio;
> > +
> > +err:
> > +     for (i--; i >= 0; i--)
> > +             folio_put(folios[i]);
> > +     return NULL;
> > +}
> > +
> > +static void fill_batched_swap_folios(struct vm_fault *vmf,
> > +             void *shadow, struct batch_swpin_buffer *buf,
> > +             struct folio *folio, struct folio **folios)
> > +{
> > +     unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
> > +     unsigned long addr;
> > +     int i;
> > +
> > +     for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
> > +             addr = haddr + i * PAGE_SIZE;
> > +             __folio_set_locked(folios[i]);
> > +             __folio_set_swapbacked(folios[i]);
> > +             if (shadow)
> > +                     workingset_refault(folios[i], shadow);
> > +             folio_add_lru(folios[i]);
> > +             copy_user_highpage(&folios[i]->page, folio_page(folio, i),
> > +                             addr, vmf->vma);
> > +             if (folio_test_uptodate(folio))
> > +                     folio_mark_uptodate(folios[i]);
> > +     }
> > +
> > +     folio->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~(1UL << PG_head));
> > +     mutex_unlock(&buf->mutex);
> > +}
> > +
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf,
> > +             struct batch_swpin_buffer **buf,
> > +             struct folio **folios)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >       unsigned long orders;
> > @@ -4180,6 +4283,9 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >
> >       pte_unmap_unlock(pte, ptl);
> >
> > +     if (!orders)
> > +             goto fallback;
> > +
> >       /* Try allocating the highest of the remaining orders. */
> >       gfp = vma_thp_gfp_mask(vma);
> >       while (orders) {
> > @@ -4194,14 +4300,29 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >               order = next_order(&orders, order);
> >       }
> >
> > +     /*
> > +      * During swap-out, a THP might have been compressed into multiple
> > +      * order-2 blocks to optimize CPU usage and compression ratio.
> > +      * Attempt to batch swap-in 4 smaller folios to ensure they are
> > +      * decompressed together as a single unit only once.
> > +      */
> > +     return alloc_batched_swap_folios(vmf, buf, folios, entry);
> > +
> >  fallback:
> >       return __alloc_swap_folio(vmf);
> >  }
> >  #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> > -static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf,
> > +             struct batch_swpin_buffer **buf,
> > +             struct folio **folios)
> >  {
> >       return __alloc_swap_folio(vmf);
> >  }
> > +static inline void fill_batched_swap_folios(struct vm_fault *vmf,
> > +             void *shadow, struct batch_swpin_buffer *buf,
> > +             struct folio *folio, struct folio **folios)
> > +{
> > +}
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> >  static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > @@ -4216,6 +4337,8 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> >   */
> >  vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> > +     struct folio *folios[BATCH_SWPIN_COUNT] = { NULL };
> > +     struct batch_swpin_buffer *buf = NULL;
> >       struct vm_area_struct *vma = vmf->vma;
> >       struct folio *swapcache, *folio = NULL;
> >       DECLARE_WAITQUEUE(wait, current);
> > @@ -4228,7 +4351,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > -     int nr_pages;
> > +     int nr_pages, i;
> >       unsigned long page_idx;
> >       unsigned long address;
> >       pte_t *ptep;
> > @@ -4296,7 +4419,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> >                       /* skip swapcache */
> > -                     folio = alloc_swap_folio(vmf);
> > +                     folio = alloc_swap_folio(vmf, &buf, folios);
> >                       if (folio) {
> >                               __folio_set_locked(folio);
> >                               __folio_set_swapbacked(folio);
> > @@ -4327,10 +4450,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> >
> >                               shadow = get_shadow_from_swap_cache(entry);
> > -                             if (shadow)
> > +                             if (shadow && !buf)
> >                                       workingset_refault(folio, shadow);
> > -
> > -                             folio_add_lru(folio);
> > +                             if (!buf)
> > +                                     folio_add_lru(folio);
> >
> >                               /* To provide entry to swap_read_folio() */
> >                               folio->swap = entry;
> > @@ -4361,6 +4484,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               count_vm_event(PGMAJFAULT);
> >               count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> >               page = folio_file_page(folio, swp_offset(entry));
> > +             /*
> > +              * Copy data into batched small folios from the large
> > +              * folio buffer
> > +              */
> > +             if (buf) {
> > +                     fill_batched_swap_folios(vmf, shadow, buf, folio, folios);
> > +                     folio = folios[0];
> > +                     page = &folios[0]->page;
> > +                     goto do_map;
> > +             }
> >       } else if (PageHWPoison(page)) {
> >               /*
> >                * hwpoisoned dirty swapcache pages are kept for killing
> > @@ -4415,6 +4548,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                       lru_add_drain();
> >       }
> >
> > +do_map:
> >       folio_throttle_swaprate(folio, GFP_KERNEL);
> >
> >       /*
> > @@ -4431,8 +4565,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >
> >       /* allocated large folios for SWP_SYNCHRONOUS_IO */
> > -     if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> > -             unsigned long nr = folio_nr_pages(folio);
> > +     if ((folio_test_large(folio) || buf) && !folio_test_swapcache(folio)) {
> > +             unsigned long nr = buf ? BATCH_SWPIN_COUNT : folio_nr_pages(folio);
> >               unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> >               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
> >               pte_t *folio_ptep = vmf->pte - idx;
> > @@ -4527,6 +4661,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               }
> >       }
> >
> > +     /* Batched mapping of allocated small folios for SWP_SYNCHRONOUS_IO */
> > +     if (buf) {
> > +             for (i = 0; i < nr_pages; i++)
> > +                     arch_swap_restore(swp_entry(swp_type(entry),
> > +                             swp_offset(entry) + i), folios[i]);
> > +             swap_free_nr(entry, nr_pages);
> > +             add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +             add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +             rmap_flags |= RMAP_EXCLUSIVE;
> > +             for (i = 0; i < nr_pages; i++) {
> > +                     unsigned long addr = address + i * PAGE_SIZE;
> > +
> > +                     pte = mk_pte(&folios[i]->page, vma->vm_page_prot);
> > +                     if (pte_swp_soft_dirty(vmf->orig_pte))
> > +                             pte = pte_mksoft_dirty(pte);
> > +                     if (pte_swp_uffd_wp(vmf->orig_pte))
> > +                             pte = pte_mkuffd_wp(pte);
> > +                     if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
> > +                         !pte_needs_soft_dirty_wp(vma, pte)) {
> > +                             pte = pte_mkwrite(pte, vma);
> > +                             if ((vmf->flags & FAULT_FLAG_WRITE) && (i == page_idx)) {
> > +                                     pte = pte_mkdirty(pte);
> > +                                     vmf->flags &= ~FAULT_FLAG_WRITE;
> > +                             }
> > +                     }
> > +                     flush_icache_page(vma, &folios[i]->page);
> > +                     folio_add_new_anon_rmap(folios[i], vma, addr, rmap_flags);
> > +                     set_pte_at(vma->vm_mm, addr, ptep + i, pte);
> > +                     arch_do_swap_page_nr(vma->vm_mm, vma, addr, pte, pte, 1);
> > +                     if (i == page_idx)
> > +                             vmf->orig_pte = pte;
> > +                     folio_unlock(folios[i]);
> > +             }
> > +             goto wp_page;
> > +     }
> > +
> >       /*
> >        * Some architectures may have to restore extra metadata to the page
> >        * when reading from swap. This metadata may be indexed by swap entry
> > @@ -4612,6 +4782,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               folio_put(swapcache);
> >       }
> >
> > +wp_page:
> >       if (vmf->flags & FAULT_FLAG_WRITE) {
> >               ret |= do_wp_page(vmf);
> >               if (ret & VM_FAULT_ERROR)
> > @@ -4638,9 +4809,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out_page:
> > -     folio_unlock(folio);
> > +     if (!buf) {
> > +             folio_unlock(folio);
> > +     } else {
> > +             for (i = 0; i < BATCH_SWPIN_COUNT; i++)
> > +                     folio_unlock(folios[i]);
> > +     }
> >  out_release:
> > -     folio_put(folio);
> > +     if (!buf) {
> > +             folio_put(folio);
> > +     } else {
> > +             for (i = 0; i < BATCH_SWPIN_COUNT; i++)
> > +                     folio_put(folios[i]);
> > +     }
> >       if (folio != swapcache && swapcache) {
> >               folio_unlock(swapcache);
> >               folio_put(swapcache);
>

Thanks
Barry