On 03/04/2024 01:30, Zi Yan wrote: > On 27 Mar 2024, at 10:45, Ryan Roberts wrote: > >> Now that we no longer have a convenient flag in the cluster to determine >> if a folio is large, free_swap_and_cache() will take a reference and >> lock a large folio much more often, which could lead to contention and >> (e.g.) failure to split large folios, etc. >> >> Let's solve that problem by batch freeing swap and cache with a new >> function, free_swap_and_cache_nr(), to free a contiguous range of swap >> entries together. This allows us to first drop a reference to each swap >> slot before we try to release the cache folio. This means we only try to >> release the folio once, only taking the reference and lock once - much >> better than the previous 512 times for the 2M THP case. >> >> Contiguous swap entries are gathered in zap_pte_range() and >> madvise_free_pte_range() in a similar way to how present ptes are >> already gathered in zap_pte_range(). >> >> While we are at it, let's simplify by converting the return type of both >> functions to void. The return value was used only by zap_pte_range() to >> print a bad pte, and was ignored by everyone else, so the extra >> reporting wasn't exactly guaranteed. We will still get the warning with >> most of the information from get_swap_device(). With the batch version, >> we wouldn't know which pte was bad anyway so could print the wrong one. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> include/linux/pgtable.h | 28 +++++++++++++++ >> include/linux/swap.h | 12 +++++-- >> mm/internal.h | 48 +++++++++++++++++++++++++ >> mm/madvise.c | 12 ++++--- >> mm/memory.c | 13 +++---- >> mm/swapfile.c | 78 ++++++++++++++++++++++++++++++----------- >> 6 files changed, 157 insertions(+), 34 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index 09c85c7bf9c2..8185939df1e8 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm, >> } >> #endif >> >> +#ifndef clear_not_present_full_ptes >> +/** >> + * clear_not_present_full_ptes - Clear consecutive not present PTEs. >> + * @mm: Address space the ptes represent. >> + * @addr: Address of the first pte. >> + * @ptep: Page table pointer for the first entry. >> + * @nr: Number of entries to clear. >> + * @full: Whether we are clearing a full mm. >> + * >> + * May be overridden by the architecture; otherwise, implemented as a simple >> + * loop over pte_clear_not_present_full(). >> + * >> + * Context: The caller holds the page table lock. The PTEs are all not present. >> + * The PTEs are all in the same PMD. >> + */ >> +static inline void clear_not_present_full_ptes(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, unsigned int nr, int full) >> +{ >> + for (;;) { >> + pte_clear_not_present_full(mm, addr, ptep, full); >> + if (--nr == 0) >> + break; >> + ptep++; >> + addr += PAGE_SIZE; >> + } >> +} >> +#endif >> + > > Would the code below be better? > > for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) > pte_clear_not_present_full(mm, addr, ptep, full); I certainly agree that this is cleaner and more standard. But I'm copying the pattern used by the other batch helpers. I believe this pattern was first done by Willy for set_ptes(), then continued by DavidH for wrprotect_ptes() and clear_full_ptes(). I guess the benefit is that ptep and addr are only incremented if we are going around the loop again. I'd rather continue to be consistent with those other helpers. > > -- > Best Regards, > Yan, Zi