April 3, 2024 at 8:30 AM, "Zi Yan" <ziy@xxxxxxxxxx> wrote: > > On 27 Mar 2024, at 10:45, Ryan Roberts wrote: > > > > > Now that we no longer have a convenient flag in the cluster to determine > > > > if a folio is large, free_swap_and_cache() will take a reference and > > > > lock a large folio much more often, which could lead to contention and > > > > (e.g.) failure to split large folios, etc. > > > > Let's solve that problem by batch freeing swap and cache with a new > > > > function, free_swap_and_cache_nr(), to free a contiguous range of swap > > > > entries together. This allows us to first drop a reference to each swap > > > > slot before we try to release the cache folio. This means we only try to > > > > release the folio once, only taking the reference and lock once - much > > > > better than the previous 512 times for the 2M THP case. > > > > Contiguous swap entries are gathered in zap_pte_range() and > > > > madvise_free_pte_range() in a similar way to how present ptes are > > > > already gathered in zap_pte_range(). > > > > While we are at it, let's simplify by converting the return type of both > > > > functions to void. The return value was used only by zap_pte_range() to > > > > print a bad pte, and was ignored by everyone else, so the extra > > > > reporting wasn't exactly guaranteed. We will still get the warning with > > > > most of the information from get_swap_device(). With the batch version, > > > > we wouldn't know which pte was bad anyway so could print the wrong one. > > > > Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > > > > --- > > > > include/linux/pgtable.h | 28 +++++++++++++++ > > > > include/linux/swap.h | 12 +++++-- > > > > mm/internal.h | 48 +++++++++++++++++++++++++ > > > > mm/madvise.c | 12 ++++--- > > > > mm/memory.c | 13 +++---- > > > > mm/swapfile.c | 78 ++++++++++++++++++++++++++++++----------- > > > > 6 files changed, 157 insertions(+), 34 deletions(-) > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > > > index 09c85c7bf9c2..8185939df1e8 100644 > > > > --- a/include/linux/pgtable.h > > > > +++ b/include/linux/pgtable.h > > > > @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm, > > > > } > > > > #endif > > > > +#ifndef clear_not_present_full_ptes > > > > +/** > > > > + * clear_not_present_full_ptes - Clear consecutive not present PTEs. > > > > + * @mm: Address space the ptes represent. > > > > + * @addr: Address of the first pte. > > > > + * @ptep: Page table pointer for the first entry. > > > > + * @nr: Number of entries to clear. > > > > + * @full: Whether we are clearing a full mm. > > > > + * > > > > + * May be overridden by the architecture; otherwise, implemented as a simple > > > > + * loop over pte_clear_not_present_full(). > > > > + * > > > > + * Context: The caller holds the page table lock. The PTEs are all not present. > > > > + * The PTEs are all in the same PMD. > > > > + */ > > > > +static inline void clear_not_present_full_ptes(struct mm_struct *mm, > > > > + unsigned long addr, pte_t *ptep, unsigned int nr, int full) > > > > +{ > > > > + for (;;) { > > > > + pte_clear_not_present_full(mm, addr, ptep, full); > > > > + if (--nr == 0) > > > > + break; > > > > + ptep++; > > > > + addr += PAGE_SIZE; > > > > + } > > > > +} > > > > +#endif > > > > + > > > > Would the code below be better? > > for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) > FWIW for (; nr-- > 0; ptep++, addr += PAGE_SIZE) pte_clear_not_present_full(mm, addr, ptep, full); Thanks, Lance > pte_clear_not_present_full(mm, addr, ptep, full); > > -- > > Best Regards, > > Yan, Zi >