On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 21/11/2024 22:25, Barry Song wrote: > > From: Barry Song <v-songbaohua@xxxxxxxx> > > > > The swapfile can compress/decompress at 4 * PAGES granularity, reducing > > CPU usage and improving the compression ratio. However, if allocating an > > mTHP fails and we fall back to a single small folio, the entire large > > block must still be decompressed. This results in a 16 KiB area requiring > > 4 page faults, where each fault decompresses 16 KiB but retrieves only > > 4 KiB of data from the block. To address this inefficiency, we instead > > fall back to 4 small folios, ensuring that each decompression occurs > > only once. > > > > Allowing swap_read_folio() to decompress and read into an array of > > 4 folios would be extremely complex, requiring extensive changes > > throughout the stack, including swap_read_folio, zeromap, > > zswap, and final swap implementations like zRAM. In contrast, > > having these components fill a large folio with 4 subpages is much > > simpler. > > > > To avoid a full-stack modification, we introduce a per-CPU order-2 > > large folio as a buffer. This buffer is used for swap_read_folio(), > > after which the data is copied into the 4 small folios. Finally, in > > do_swap_page(), all these small folios are mapped. > > > > Co-developed-by: Chuanhua Han <chuanhuahan@xxxxxxxxx> > > Signed-off-by: Chuanhua Han <chuanhuahan@xxxxxxxxx> > > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > > --- > > mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 192 insertions(+), 11 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 209885a4134f..e551570c1425 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > > return folio; > > } > > > > +#define BATCH_SWPIN_ORDER 2 > > Hi Barry, > > Thanks for the series and the numbers in the cover letter. > > Just a few things. > > Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2? Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER and always setting it to 2, which is the minimum anonymous mTHP order. The main reason is that it may be difficult for users to select the appropriate Kconfig? On the other hand, 16KB provides the most advantages for zstd compression and decompression with larger blocks. While increasing from 16KB to 32KB or 64KB can offer additional benefits, the improvement is not as significant as the jump from 4KB to 16KB. As I use zstd to compress and decompress the 'Beyond Compare' software package: root@barry-desktop:~# ./zstd File size: 182502912 bytes 4KB Block: Compression time = 0.765915 seconds, Decompression time = 0.203366 seconds Original size: 182502912 bytes Compressed size: 66089193 bytes Compression ratio: 36.21% 16KB Block: Compression time = 0.558595 seconds, Decompression time = 0.153837 seconds Original size: 182502912 bytes Compressed size: 59159073 bytes Compression ratio: 32.42% 32KB Block: Compression time = 0.538106 seconds, Decompression time = 0.137768 seconds Original size: 182502912 bytes Compressed size: 57958701 bytes Compression ratio: 31.76% 64KB Block: Compression time = 0.532212 seconds, Decompression time = 0.127592 seconds Original size: 182502912 bytes Compressed size: 56700795 bytes Compression ratio: 31.07% In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER? > > Did you check the performance difference with and without patch 4? I retested after reverting patch 4, and the sys time increased to over 40 minutes again, though it was slightly better than without the entire series. *** Executing round 1 *** real 7m49.342s user 80m53.675s sys 42m28.393s pswpin: 29965548 pswpout: 51127359 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11347712 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6641230 pgpgin: 147376000 pgpgout: 213343124 *** Executing round 2 *** real 7m41.331s user 81m16.631s sys 41m39.845s pswpin: 29208867 pswpout: 50006026 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11104912 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6483827 pgpgin: 144057340 pgpgout: 208887688 *** Executing round 3 *** real 7m47.280s user 78m36.767s sys 37m32.210s pswpin: 26426526 pswpout: 45420734 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 10104304 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 5884839 pgpgin: 132013648 pgpgout: 190537264 *** Executing round 4 *** real 7m56.723s user 80m36.837s sys 41m35.979s pswpin: 29367639 pswpout: 50059254 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11116176 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6514064 pgpgin: 144593828 pgpgout: 209080468 *** Executing round 5 *** real 7m53.806s user 80m30.953s sys 40m14.870s pswpin: 28091760 pswpout: 48495748 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 10779720 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6244819 pgpgin: 138813124 pgpgout: 202885480 I guess it is due to the occurrence of numerous partial reads (about 10%, 3505537/35159852). root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat zram_bio write/read multi_pages count:54452828 35159852 zram_bio failed write/read multi_pages count 0 0 zram_bio partial write/read multi_pages count 4 3505537 multi_pages_miss_free 0 This workload doesn't cause fragmentation in the buddy allocator, so it’s likely due to the failure of MEMCG_CHARGE. > > I know that it wont help if you have a lot of unmovable pages > scattered everywhere, but were you able to compare the performance > of defrag=always vs patch 4? I feel like if you have space for 4 folios > then hopefully compaction should be able to do its job and you can > directly fill the large folio if the unmovable pages are better placed. > Johannes' series on preventing type mixing [1] would help. > > [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@xxxxxxxxxxx/ I believe this could help, but defragmentation is a complex issue. Especially on phones, where various components like drivers, DMA-BUF, multimedia, and graphics utilize memory. We observed that a fresh system could initially provide mTHP, but after a few hours, obtaining mTHP became very challenging. I'm happy to arrange a test of Johannes' series on phones (sometimes it is quite hard to backport to the Android kernel) to see if it brings any improvements. > > Thanks, > Usama > > > +#define BATCH_SWPIN_COUNT (1 << BATCH_SWPIN_ORDER) > > +#define BATCH_SWPIN_SIZE (PAGE_SIZE << BATCH_SWPIN_ORDER) > > + > > +struct batch_swpin_buffer { > > + struct folio *folio; > > + struct mutex mutex; > > +}; > > + > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) > > { > > @@ -4120,7 +4129,101 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > > return orders; > > } > > > > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > +static DEFINE_PER_CPU(struct batch_swpin_buffer, swp_buf); > > + > > +static int __init batch_swpin_buffer_init(void) > > +{ > > + int ret, cpu; > > + struct batch_swpin_buffer *buf; > > + > > + for_each_possible_cpu(cpu) { > > + buf = per_cpu_ptr(&swp_buf, cpu); > > + buf->folio = (struct folio *)alloc_pages_node(cpu_to_node(cpu), > > + GFP_KERNEL | __GFP_COMP, BATCH_SWPIN_ORDER); > > + if (!buf->folio) { > > + ret = -ENOMEM; > > + goto err; > > + } > > + mutex_init(&buf->mutex); > > + } > > + return 0; > > + > > +err: > > + for_each_possible_cpu(cpu) { > > + buf = per_cpu_ptr(&swp_buf, cpu); > > + if (buf->folio) { > > + folio_put(buf->folio); > > + buf->folio = NULL; > > + } > > + } > > + return ret; > > +} > > +core_initcall(batch_swpin_buffer_init); > > + > > +static struct folio *alloc_batched_swap_folios(struct vm_fault *vmf, > > + struct batch_swpin_buffer **buf, struct folio **folios, > > + swp_entry_t entry) > > +{ > > + unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE); > > + struct batch_swpin_buffer *sbuf = raw_cpu_ptr(&swp_buf); > > + struct folio *folio = sbuf->folio; > > + unsigned long addr; > > + int i; > > + > > + if (unlikely(!folio)) > > + return NULL; > > + > > + for (i = 0; i < BATCH_SWPIN_COUNT; i++) { > > + addr = haddr + i * PAGE_SIZE; > > + folios[i] = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma, addr); > > + if (!folios[i]) > > + goto err; > > + if (mem_cgroup_swapin_charge_folio(folios[i], vmf->vma->vm_mm, > > + GFP_KERNEL, entry)) > > + goto err; > > + } > > + > > + mutex_lock(&sbuf->mutex); > > + *buf = sbuf; > > +#ifdef CONFIG_MEMCG > > + folio->memcg_data = (*folios)->memcg_data; > > +#endif > > + return folio; > > + > > +err: > > + for (i--; i >= 0; i--) > > + folio_put(folios[i]); > > + return NULL; > > +} > > + > > +static void fill_batched_swap_folios(struct vm_fault *vmf, > > + void *shadow, struct batch_swpin_buffer *buf, > > + struct folio *folio, struct folio **folios) > > +{ > > + unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE); > > + unsigned long addr; > > + int i; > > + > > + for (i = 0; i < BATCH_SWPIN_COUNT; i++) { > > + addr = haddr + i * PAGE_SIZE; > > + __folio_set_locked(folios[i]); > > + __folio_set_swapbacked(folios[i]); > > + if (shadow) > > + workingset_refault(folios[i], shadow); > > + folio_add_lru(folios[i]); > > + copy_user_highpage(&folios[i]->page, folio_page(folio, i), > > + addr, vmf->vma); > > + if (folio_test_uptodate(folio)) > > + folio_mark_uptodate(folios[i]); > > + } > > + > > + folio->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~(1UL << PG_head)); > > + mutex_unlock(&buf->mutex); > > +} > > + > > +static struct folio *alloc_swap_folio(struct vm_fault *vmf, > > + struct batch_swpin_buffer **buf, > > + struct folio **folios) > > { > > struct vm_area_struct *vma = vmf->vma; > > unsigned long orders; > > @@ -4180,6 +4283,9 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > > pte_unmap_unlock(pte, ptl); > > > > + if (!orders) > > + goto fallback; > > + > > /* Try allocating the highest of the remaining orders. */ > > gfp = vma_thp_gfp_mask(vma); > > while (orders) { > > @@ -4194,14 +4300,29 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > order = next_order(&orders, order); > > } > > > > + /* > > + * During swap-out, a THP might have been compressed into multiple > > + * order-2 blocks to optimize CPU usage and compression ratio. > > + * Attempt to batch swap-in 4 smaller folios to ensure they are > > + * decompressed together as a single unit only once. > > + */ > > + return alloc_batched_swap_folios(vmf, buf, folios, entry); > > + > > fallback: > > return __alloc_swap_folio(vmf); > > } > > #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ > > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > +static struct folio *alloc_swap_folio(struct vm_fault *vmf, > > + struct batch_swpin_buffer **buf, > > + struct folio **folios) > > { > > return __alloc_swap_folio(vmf); > > } > > +static inline void fill_batched_swap_folios(struct vm_fault *vmf, > > + void *shadow, struct batch_swpin_buffer *buf, > > + struct folio *folio, struct folio **folios) > > +{ > > +} > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); > > @@ -4216,6 +4337,8 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); > > */ > > vm_fault_t do_swap_page(struct vm_fault *vmf) > > { > > + struct folio *folios[BATCH_SWPIN_COUNT] = { NULL }; > > + struct batch_swpin_buffer *buf = NULL; > > struct vm_area_struct *vma = vmf->vma; > > struct folio *swapcache, *folio = NULL; > > DECLARE_WAITQUEUE(wait, current); > > @@ -4228,7 +4351,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > pte_t pte; > > vm_fault_t ret = 0; > > void *shadow = NULL; > > - int nr_pages; > > + int nr_pages, i; > > unsigned long page_idx; > > unsigned long address; > > pte_t *ptep; > > @@ -4296,7 +4419,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > __swap_count(entry) == 1) { > > /* skip swapcache */ > > - folio = alloc_swap_folio(vmf); > > + folio = alloc_swap_folio(vmf, &buf, folios); > > if (folio) { > > __folio_set_locked(folio); > > __folio_set_swapbacked(folio); > > @@ -4327,10 +4450,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > mem_cgroup_swapin_uncharge_swap(entry, nr_pages); > > > > shadow = get_shadow_from_swap_cache(entry); > > - if (shadow) > > + if (shadow && !buf) > > workingset_refault(folio, shadow); > > - > > - folio_add_lru(folio); > > + if (!buf) > > + folio_add_lru(folio); > > > > /* To provide entry to swap_read_folio() */ > > folio->swap = entry; > > @@ -4361,6 +4484,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > count_vm_event(PGMAJFAULT); > > count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); > > page = folio_file_page(folio, swp_offset(entry)); > > + /* > > + * Copy data into batched small folios from the large > > + * folio buffer > > + */ > > + if (buf) { > > + fill_batched_swap_folios(vmf, shadow, buf, folio, folios); > > + folio = folios[0]; > > + page = &folios[0]->page; > > + goto do_map; > > + } > > } else if (PageHWPoison(page)) { > > /* > > * hwpoisoned dirty swapcache pages are kept for killing > > @@ -4415,6 +4548,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > lru_add_drain(); > > } > > > > +do_map: > > folio_throttle_swaprate(folio, GFP_KERNEL); > > > > /* > > @@ -4431,8 +4565,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > } > > > > /* allocated large folios for SWP_SYNCHRONOUS_IO */ > > - if (folio_test_large(folio) && !folio_test_swapcache(folio)) { > > - unsigned long nr = folio_nr_pages(folio); > > + if ((folio_test_large(folio) || buf) && !folio_test_swapcache(folio)) { > > + unsigned long nr = buf ? BATCH_SWPIN_COUNT : folio_nr_pages(folio); > > unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); > > unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; > > pte_t *folio_ptep = vmf->pte - idx; > > @@ -4527,6 +4661,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > } > > } > > > > + /* Batched mapping of allocated small folios for SWP_SYNCHRONOUS_IO */ > > + if (buf) { > > + for (i = 0; i < nr_pages; i++) > > + arch_swap_restore(swp_entry(swp_type(entry), > > + swp_offset(entry) + i), folios[i]); > > + swap_free_nr(entry, nr_pages); > > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); > > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > > + rmap_flags |= RMAP_EXCLUSIVE; > > + for (i = 0; i < nr_pages; i++) { > > + unsigned long addr = address + i * PAGE_SIZE; > > + > > + pte = mk_pte(&folios[i]->page, vma->vm_page_prot); > > + if (pte_swp_soft_dirty(vmf->orig_pte)) > > + pte = pte_mksoft_dirty(pte); > > + if (pte_swp_uffd_wp(vmf->orig_pte)) > > + pte = pte_mkuffd_wp(pte); > > + if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && > > + !pte_needs_soft_dirty_wp(vma, pte)) { > > + pte = pte_mkwrite(pte, vma); > > + if ((vmf->flags & FAULT_FLAG_WRITE) && (i == page_idx)) { > > + pte = pte_mkdirty(pte); > > + vmf->flags &= ~FAULT_FLAG_WRITE; > > + } > > + } > > + flush_icache_page(vma, &folios[i]->page); > > + folio_add_new_anon_rmap(folios[i], vma, addr, rmap_flags); > > + set_pte_at(vma->vm_mm, addr, ptep + i, pte); > > + arch_do_swap_page_nr(vma->vm_mm, vma, addr, pte, pte, 1); > > + if (i == page_idx) > > + vmf->orig_pte = pte; > > + folio_unlock(folios[i]); > > + } > > + goto wp_page; > > + } > > + > > /* > > * Some architectures may have to restore extra metadata to the page > > * when reading from swap. This metadata may be indexed by swap entry > > @@ -4612,6 +4782,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > folio_put(swapcache); > > } > > > > +wp_page: > > if (vmf->flags & FAULT_FLAG_WRITE) { > > ret |= do_wp_page(vmf); > > if (ret & VM_FAULT_ERROR) > > @@ -4638,9 +4809,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (vmf->pte) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > out_page: > > - folio_unlock(folio); > > + if (!buf) { > > + folio_unlock(folio); > > + } else { > > + for (i = 0; i < BATCH_SWPIN_COUNT; i++) > > + folio_unlock(folios[i]); > > + } > > out_release: > > - folio_put(folio); > > + if (!buf) { > > + folio_put(folio); > > + } else { > > + for (i = 0; i < BATCH_SWPIN_COUNT; i++) > > + folio_put(folios[i]); > > + } > > if (folio != swapcache && swapcache) { > > folio_unlock(swapcache); > > folio_put(swapcache); > Thanks Barry