On Fri, Aug 16, 2024 at 7:06 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote: > > > > > > > > On 2024/8/15 17:47, Kairui Song wrote: > > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > >> > > >> From: Chuanhua Han <hanchuanhua@xxxxxxxx> > > > > > > Hi Chuanhua, > > > > > >> > > ... > > > > >> + > > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > >> +{ > > >> + struct vm_area_struct *vma = vmf->vma; > > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > >> + unsigned long orders; > > >> + struct folio *folio; > > >> + unsigned long addr; > > >> + swp_entry_t entry; > > >> + spinlock_t *ptl; > > >> + pte_t *pte; > > >> + gfp_t gfp; > > >> + int order; > > >> + > > >> + /* > > >> + * If uffd is active for the vma we need per-page fault fidelity to > > >> + * maintain the uffd semantics. > > >> + */ > > >> + if (unlikely(userfaultfd_armed(vma))) > > >> + goto fallback; > > >> + > > >> + /* > > >> + * A large swapped out folio could be partially or fully in zswap. We > > >> + * lack handling for such cases, so fallback to swapping in order-0 > > >> + * folio. > > >> + */ > > >> + if (!zswap_never_enabled()) > > >> + goto fallback; > > >> + > > >> + entry = pte_to_swp_entry(vmf->orig_pte); > > >> + /* > > >> + * Get a list of all the (large) orders below PMD_ORDER that are enabled > > >> + * and suitable for swapping THP. > > >> + */ > > >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, > > >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > > >> + orders = thp_vma_suitable_orders(vma, vmf->address, orders); > > >> + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); > > >> + > > >> + if (!orders) > > >> + goto fallback; > > >> + > > >> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); > > >> + if (unlikely(!pte)) > > >> + goto fallback; > > >> + > > >> + /* > > >> + * For do_swap_page, find the highest order where the aligned range is > > >> + * completely swap entries with contiguous swap offsets. > > >> + */ > > >> + order = highest_order(orders); > > >> + while (orders) { > > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > > >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) > > >> + break; > > >> + order = next_order(&orders, order); > > >> + } > > >> + > > >> + pte_unmap_unlock(pte, ptl); > > >> + > > >> + /* Try allocating the highest of the remaining orders. */ > > >> + gfp = vma_thp_gfp_mask(vma); > > >> + while (orders) { > > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); > > >> + if (folio) > > >> + return folio; > > >> + order = next_order(&orders, order); > > >> + } > > >> + > > >> +fallback: > > >> +#endif > > >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); > > >> +} > > >> + > > >> + > > >> /* > > >> * We enter with non-exclusive mmap_lock (to exclude vma changes, > > >> * but allow concurrent faults), and pte mapped but not yet locked. > > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > >> if (!folio) { > > >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > >> __swap_count(entry) == 1) { > > >> - /* > > >> - * Prevent parallel swapin from proceeding with > > >> - * the cache flag. Otherwise, another thread may > > >> - * finish swapin first, free the entry, and swapout > > >> - * reusing the same entry. It's undetectable as > > >> - * pte_same() returns true due to entry reuse. > > >> - */ > > >> - if (swapcache_prepare(entry, 1)) { > > >> - /* Relax a bit to prevent rapid repeated page faults */ > > >> - schedule_timeout_uninterruptible(1); > > >> - goto out; > > >> - } > > >> - need_clear_cache = true; > > >> - > > >> /* skip swapcache */ > > >> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > >> - vma, vmf->address, false); > > >> + folio = alloc_swap_folio(vmf); > > >> page = &folio->page; > > >> if (folio) { > > >> __folio_set_locked(folio); > > >> __folio_set_swapbacked(folio); > > >> > > >> + nr_pages = folio_nr_pages(folio); > > >> + if (folio_test_large(folio)) > > >> + entry.val = ALIGN_DOWN(entry.val, nr_pages); > > >> + /* > > >> + * Prevent parallel swapin from proceeding with > > >> + * the cache flag. Otherwise, another thread may > > >> + * finish swapin first, free the entry, and swapout > > >> + * reusing the same entry. It's undetectable as > > >> + * pte_same() returns true due to entry reuse. > > >> + */ > > >> + if (swapcache_prepare(entry, nr_pages)) { > > >> + /* Relax a bit to prevent rapid repeated page faults */ > > >> + schedule_timeout_uninterruptible(1); > > >> + goto out_page; > > >> + } > > >> + need_clear_cache = true; > > >> + > > >> if (mem_cgroup_swapin_charge_folio(folio, > > >> vma->vm_mm, GFP_KERNEL, > > >> entry)) { > > >> ret = VM_FAULT_OOM; > > >> goto out_page; > > >> } > > > > > > After your patch, with build kernel test, I'm seeing kernel log > > > spamming like this: > > > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed > > > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > ............ > > > > > > And heavy performance loss with workloads limited by memcg, mTHP enabled. > > > > > > After some debugging, the problematic part is the > > > mem_cgroup_swapin_charge_folio call above. > > > When under pressure, cgroup charge fails easily for mTHP. One 64k > > > swapin will require a much more aggressive reclaim to success. > > > > > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is > > > gone and mTHP swapin should have a much higher swapin success rate. > > > But this might not be the right way. > > > > > > For this particular issue, maybe you can change the charge order, try > > > charging first, if successful, use mTHP. if failed, fallback to 4k? > > > > This is what we did in alloc_anon_folio(), see 085ff35e7636 > > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"), > > 1) fallback earlier > > 2) using same GFP flags for allocation and charge > > > > but it seems that there is a little complicated for swapin charge > > Kefeng, thanks! I guess we can continue using the same approach and > it's not too complicated. > > Kairui, sorry for the trouble and thanks for the report! could you > check if the solution below resolves the issue? On phones, we don't > encounter the scenarios you’re facing. > > From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001 > From: Barry Song <v-songbaohua@xxxxxxxx> > Date: Fri, 16 Aug 2024 10:51:48 +1200 > Subject: [PATCH] mm: fallback to next_order if charing mTHP fails > > When memcg approaches its limit, charging mTHP becomes difficult. > At this point, when the charge fails, we fallback to the next order > to avoid repeatedly retrying larger orders. > > Reported-by: Kairui Song <ryncsn@xxxxxxxxx> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > --- > mm/memory.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 0ed3603aaf31..6cba28ef91e7 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > while (orders) { > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > folio = vma_alloc_folio(gfp, order, vma, addr, true); > - if (folio) > - return folio; > + if (folio) { > + if (!mem_cgroup_swapin_charge_folio(folio, > + vma->vm_mm, gfp, entry)) > + return folio; > + folio_put(folio); > + } > order = next_order(&orders, order); > } > > @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > } > need_clear_cache = true; > > - if (mem_cgroup_swapin_charge_folio(folio, > + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio, > vma->vm_mm, GFP_KERNEL, > entry)) { > ret = VM_FAULT_OOM; > -- > 2.34.1 > Hi Barry After the fix the spamming log is gone, thanks for the fix. > > Thanks > Barry >