Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

Ryan Roberts <ryan.roberts@xxxxxxx> · Mon, 8 Apr 2024 10:22:51 +0100

Andrew, Looks like there are a few niggles for me to address below. So please
don't let v6 (currently in mm-unstable) progress to mm-stable. I'll aim to get a
new version out (or patch for this depending on how discussion lands) in the
next couple of days.

On 05/04/2024 11:13, David Hildenbrand wrote:
> On 03.04.24 13:40, Ryan Roberts wrote:
>> Now that we no longer have a convenient flag in the cluster to determine
>> if a folio is large, free_swap_and_cache() will take a reference and
>> lock a large folio much more often, which could lead to contention and
>> (e.g.) failure to split large folios, etc.
>>
>> Let's solve that problem by batch freeing swap and cache with a new
>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>> entries together. This allows us to first drop a reference to each swap
>> slot before we try to release the cache folio. This means we only try to
>> release the folio once, only taking the reference and lock once - much
>> better than the previous 512 times for the 2M THP case.
>>
>> Contiguous swap entries are gathered in zap_pte_range() and
>> madvise_free_pte_range() in a similar way to how present ptes are
>> already gathered in zap_pte_range().
>>
>> While we are at it, let's simplify by converting the return type of both
>> functions to void. The return value was used only by zap_pte_range() to
>> print a bad pte, and was ignored by everyone else, so the extra
>> reporting wasn't exactly guaranteed. We will still get the warning with
>> most of the information from get_swap_device(). With the batch version,
>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>> ---
>>   include/linux/pgtable.h | 28 ++++++++++++++
>>   include/linux/swap.h    | 12 ++++--
>>   mm/internal.h           | 48 +++++++++++++++++++++++
>>   mm/madvise.c            | 12 ++++--
>>   mm/memory.c             | 13 ++++---
>>   mm/swapfile.c           | 86 ++++++++++++++++++++++++++++++++---------
>>   6 files changed, 167 insertions(+), 32 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a3fc8150b047..0278259f7078 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct
>> mm_struct *mm,
>>   }
>>   #endif
>>   +#ifndef clear_not_present_full_ptes
>> +/**
>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
> 
> 
> Consecutive only in the page table or also in some other sense?
> 
> I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any
> offset/pfn.

yes.

> 
> Consider document that.

How about: "Clear multiple not present PTEs which are consecutive in the pgtable"?

> 
>> + * @mm: Address space the ptes represent.
>> + * @addr: Address of the first pte.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear.
>> + * @full: Whether we are clearing a full mm.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over pte_clear_not_present_full().
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>> + * The PTEs are all in the same PMD.
>> + */
>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>> +        unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> +    for (;;) {
>> +        pte_clear_not_present_full(mm, addr, ptep, full);
>> +        if (--nr == 0)
>> +            break;
>> +        ptep++;
>> +        addr += PAGE_SIZE;
>> +    }
>> +}
>> +#endif
>> +
>>   #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>   extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>                     unsigned long address,
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index f6f78198f000..5737236dc3ce 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>>   extern int swapcache_prepare(swp_entry_t);
>>   extern void swap_free(swp_entry_t);
>>   extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> -extern int free_swap_and_cache(swp_entry_t);
>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>   int swap_type_of(dev_t device, sector_t offset);
>>   int find_first_swap(dev_t *device);
>>   extern unsigned int count_swap_pages(int, int);
>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct
>> *si)
>>   #define free_pages_and_swap_cache(pages, nr) \
>>       release_pages((pages), (nr));
>>   
> 
> 
> [...]
> 
>> +
>> +/**
>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @entry: Swap entry recovered from the first table entry.
>> + *
>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>> + * containing swap entries all with consecutive offsets and targeting the same
>> + * swap type.
>> + *
> 
> Likely you should document that any swp pte bits are ignored? ()

Sorry I don't understand this comment. I thought any non-none, non-present PTE
was always considered to contain only a "swap entry" and a swap entry consists
of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that
are not part of the swap entry?

> 
>> + * max_nr must be at least one and must be limited by the caller so scanning
>> + * cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> + */
>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>> +                 swp_entry_t entry)
>> +{
>> +    const pte_t *end_ptep = start_ptep + max_nr;
>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>> +    unsigned int expected_type = swp_type(entry);
>> +    pte_t *ptep = start_ptep + 1;
>> +
>> +    VM_WARN_ON(max_nr < 1);
>> +    VM_WARN_ON(non_swap_entry(entry));
>> +
>> +    while (ptep < end_ptep) {
>> +        pte_t pte = ptep_get(ptep);
>> +
>> +        if (pte_none(pte) || pte_present(pte))
>> +            break;
>> +
>> +        entry = pte_to_swp_entry(pte);
>> +
>> +        if (non_swap_entry(entry) ||
>> +            swp_type(entry) != expected_type ||
>> +            swp_offset(entry) != expected_offset)
>> +            break;
>> +
>> +        expected_offset++;
>> +        ptep++;
>> +    }
>> +
>> +    return ptep - start_ptep;
>> +}
> 
> Looks very clean :)
> 
> I was wondering whether we could similarly construct the expected swp PTE and
> only check pte_same.
> 
> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
> 
> ... or have a variant to increase only the swp offset for an existing pte. But
> non-trivial due to the arch-dependent format.
> 
> But then, we'd fail on mismatch of other swp pte bits.

Hmm, perhaps I have a misunderstanding regarding "swp pte bits"...

> 
> 
> On swapin, when reusing this function (likely!), we'll might to make sure that
> the PTE bits match as well.
> 
> See below regarding uffd-wp.
> 
> 
>>   #endif /* CONFIG_MMU */
>>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 1f77a51baaac..070bedb4996e 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>       struct folio *folio;
>>       int nr_swap = 0;
>>       unsigned long next;
>> +    int nr, max_nr;
>>         next = pmd_addr_end(addr, end);
>>       if (pmd_trans_huge(*pmd))
>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>           return 0;
>>       flush_tlb_batched_pending(mm);
>>       arch_enter_lazy_mmu_mode();
>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>> +        nr = 1;
>>           ptent = ptep_get(pte);
>>             if (pte_none(ptent))
>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>                 entry = pte_to_swp_entry(ptent);
>>               if (!non_swap_entry(entry)) {
>> -                nr_swap--;
>> -                free_swap_and_cache(entry);
>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +                max_nr = (end - addr) / PAGE_SIZE;
>> +                nr = swap_pte_batch(pte, max_nr, entry);
>> +                nr_swap -= nr;
>> +                free_swap_and_cache_nr(entry, nr);
>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>               } else if (is_hwpoison_entry(entry) ||
>>                      is_poisoned_swp_entry(entry)) {
>>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7dc6c3d9fa83..ef2968894718 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>> *tlb,
>>                   folio_remove_rmap_pte(folio, page, vma);
>>               folio_put(folio);
>>           } else if (!non_swap_entry(entry)) {
>> -            /* Genuine swap entry, hence a private anon page */
>> +            max_nr = (end - addr) / PAGE_SIZE;
>> +            nr = swap_pte_batch(pte, max_nr, entry);
>> +            /* Genuine swap entries, hence a private anon pages */
>>               if (!should_zap_cows(details))
>>                   continue;
>> -            rss[MM_SWAPENTS]--;
>> -            if (unlikely(!free_swap_and_cache(entry)))
>> -                print_bad_pte(vma, addr, ptent, NULL);
>> +            rss[MM_SWAPENTS] -= nr;
>> +            free_swap_and_cache_nr(entry, nr);
>>           } else if (is_migration_entry(entry)) {
>>               folio = pfn_swap_entry_folio(entry);
>>               if (!should_zap_folio(details, folio))
>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>               WARN_ON_ONCE(1);
>>           }
>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> 
> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
> 
> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
> 
> On mixture, you either lose some or place too many markers.

What path are you concerned about here? I don't get how what you describe can
happen? swap_pte_batch() will only give me a batch of actual swap entries and
actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a
"non-swap" swap entry, it bails. I thought the uffd-wp info was populated based
on the VMA state at swap-in? I think you are telling me that it's persisted
across the swap per-pte?

> 
> A simple workaround would be to disable any such batching if the VMA does have
> uffd-wp enabled.
> 
>> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>         add_mm_rss_vec(mm, rss);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 0d44ee2b4f9c..d059de6896c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>>   /* Reclaim the swap entry if swap is getting full*/
>>   #define TTRS_FULL        0x4
>>   -/* returns 1 if swap entry is freed */
>> +/*
>> + * returns number of pages in the folio that backs the swap entry. If positive,
>> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
>> + * folio was associated with the swap entry.
>> + */
>>   static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>                    unsigned long offset, unsigned long flags)
>>   {
>> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>               ret = folio_free_swap(folio);
>>           folio_unlock(folio);
>>       }
>> +    ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>>       folio_put(folio);
>>       return ret;
>>   }
>> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>           swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>           spin_lock(&si->lock);
>>           /* entry was freed successfully, try to use this again */
>> -        if (swap_was_freed)
>> +        if (swap_was_freed > 0)
>>               goto checks;
>>           goto scan; /* check next one */
>>       }
>> @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio)
>>       return true;
>>   }
>>   -/*
>> - * Free the swap entry like above, but also try to
>> - * free the page cache entry if it is the last user.
>> - */
>> -int free_swap_and_cache(swp_entry_t entry)
> 
> Can we have some documentation what this function expects? How does nr relate to
> entry?
> 
> i.e., offset range defined by [entry.offset, entry.offset + nr).

Yep, will add something along these lines.

> 
>> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>   {
>> -    struct swap_info_struct *p;
> 
> It might be easier to get if you do
> 
> const unsigned long start_offset = swp_offset(entry);
> const unsigned long end_offset = start_offset + nr;

OK will do this.

> 
>> +    unsigned long end = swp_offset(entry) + nr;
>> +    unsigned int type = swp_type(entry);
>> +    struct swap_info_struct *si;
>> +    bool any_only_cache = false;
>> +    unsigned long offset;
>>       unsigned char count;
>>         if (non_swap_entry(entry))
>> -        return 1;
>> +        return;
>>   -    p = get_swap_device(entry);
>> -    if (p) {
>> -        if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
>> -            put_swap_device(p);
>> -            return 0;
>> +    si = get_swap_device(entry);
>> +    if (!si)
>> +        return;
>> +
>> +    if (WARN_ON(end > si->max))
>> +        goto out;
>> +
>> +    /*
>> +     * First free all entries in the range.
>> +     */
>> +    for (offset = swp_offset(entry); offset < end; offset++) {
>> +        if (!WARN_ON(data_race(!si->swap_map[offset]))) {
> 
> Ouch "!(!)). Confusing.
> 
> I'm sure there is a better way to write that, maybe using more lines
> 
> if (data_race(si->swap_map[offset])) {
>     ...
> } else {
>     WARN_ON_ONCE(1);
> }

OK, I thought it was kinda neat myself. I'll change it.

> 
> 
>> +            count = __swap_entry_free(si, swp_entry(type, offset));
>> +            if (count == SWAP_HAS_CACHE)
>> +                any_only_cache = true;
>>           }
>> +    }
>> +
>> +    /*
>> +     * Short-circuit the below loop if none of the entries had their
>> +     * reference drop to zero.
>> +     */
>> +    if (!any_only_cache)
>> +        goto out;
>>   -        count = __swap_entry_free(p, entry);
>> -        if (count == SWAP_HAS_CACHE)
>> -            __try_to_reclaim_swap(p, swp_offset(entry),
>> +    /*
>> +     * Now go back over the range trying to reclaim the swap cache. This is
>> +     * more efficient for large folios because we will only try to reclaim
>> +     * the swap once per folio in the common case. If we do
>> +     * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
>> +     * latter will get a reference and lock the folio for every individual
>> +     * page but will only succeed once the swap slot for every subpage is
>> +     * zero.
>> +     */
>> +    for (offset = swp_offset(entry); offset < end; offset += nr) {
>> +        nr = 1;
>> +        if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
> 
> Here we use READ_ONCE() only, above data_race(). Hmmm.

Yes. I think this is correct.

READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain
about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is
used. I believe READ_ONCE() is required here because we don't have a lock and we
want to make sure we read it in a non-tearing manner.

We don't need the READ_ONCE() above since we don't care about the exact value -
only that it's not 0 (because we should be holding a ref). So do a plain access
to give the compiler a bit more freedom. But we need to mark that with
data_race() to stop KCSAN from complaining.

> 
>> +            /*
>> +             * Folios are always naturally aligned in swap so
>> +             * advance forward to the next boundary. Zero means no
>> +             * folio was found for the swap entry, so advance by 1
>> +             * in this case. Negative value means folio was found
>> +             * but could not be reclaimed. Here we can still advance
>> +             * to the next boundary.
>> +             */
>> +            nr = __try_to_reclaim_swap(si, offset,
>>                             TTRS_UNMAPPED | TTRS_FULL);
>> -        put_swap_device(p);
>> +            if (nr == 0)
>> +                nr = 1;
>> +            else if (nr < 0)
>> +                nr = -nr;
>> +            nr = ALIGN(offset + 1, nr) - offset;
>> +        }
> 
> Apart from that nothing jumped at me.

Thanks for the review!