Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

Barry Song <21cnbao@xxxxxxxxx> · Thu, 30 Nov 2023 02:00:48 +1300

On Thu, Nov 30, 2023 at 1:43 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 28/11/2023 20:23, Barry Song wrote:
> > On Wed, Nov 29, 2023 at 12:49 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> >>
> >> On 28/11/2023 08:17, Barry Song wrote:
> >>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> >>>> +                                    unsigned long addr, pte_t *ptep)
> >>>> +{
> >>>> +    /*
> >>>> +     * When doing a full address space teardown, we can avoid unfolding the
> >>>> +     * contiguous range, and therefore avoid the associated tlbi. Instead,
> >>>> +     * just get and clear the pte. The caller is promising to call us for
> >>>> +     * every pte, so every pte in the range will be cleared by the time the
> >>>> +     * tlbi is issued.
> >>>> +     *
> >>>> +     * This approach is not perfect though, as for the duration between
> >>>> +     * returning from the first call to ptep_get_and_clear_full() and making
> >>>> +     * the final call, the contpte block in an intermediate state, where
> >>>> +     * some ptes are cleared and others are still set with the PTE_CONT bit.
> >>>> +     * If any other APIs are called for the ptes in the contpte block during
> >>>> +     * that time, we have to be very careful. The core code currently
> >>>> +     * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> >>>> +     * ptep_get() must be careful to ignore the cleared entries when
> >>>> +     * accumulating the access and dirty bits - the same goes for
> >>>> +     * ptep_get_lockless(). The only other calls we might resonably expect
> >>>> +     * are to set markers in the previously cleared ptes. (We shouldn't see
> >>>> +     * valid entries being set until after the tlbi, at which point we are
> >>>> +     * no longer in the intermediate state). Since markers are not valid,
> >>>> +     * this is safe; set_ptes() will see the old, invalid entry and will not
> >>>> +     * attempt to unfold. And the new pte is also invalid so it won't
> >>>> +     * attempt to fold. We shouldn't see this for the 'full' case anyway.
> >>>> +     *
> >>>> +     * The last remaining issue is returning the access/dirty bits. That
> >>>> +     * info could be present in any of the ptes in the contpte block.
> >>>> +     * ptep_get() will gather those bits from across the contpte block. We
> >>>> +     * don't bother doing that here, because we know that the information is
> >>>> +     * used by the core-mm to mark the underlying folio as accessed/dirty.
> >>>> +     * And since the same folio must be underpinning the whole block (that
> >>>> +     * was a requirement for folding in the first place), that information
> >>>> +     * will make it to the folio eventually once all the ptes have been
> >>>> +     * cleared. This approach means we don't have to play games with
> >>>> +     * accumulating and storing the bits. It does mean that any interleaved
> >>>> +     * calls to ptep_get() may lack correct access/dirty information if we
> >>>> +     * have already cleared the pte that happened to store it. The core code
> >>>> +     * does not rely on this though.
> >>>
> >>> even without any other threads running and touching those PTEs, this won't survive
> >>> on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
> >>
> >> No that's not the case; if you read the Arm ARM, the page table is only
> >> considered "misgrogrammed" when *valid* entries within the same contpte block
> >> have different values for the contiguous bit. We are clearing the ptes to zero
> >> here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
> >> (either due to explicit tlbi as you point out below, or due to a concurrent TLB
> >> miss which selects our entry for removal to make space for the new incomming
> >> entry), then it gets an access request for an address in our partially cleared
> >> contpte block the address will either be:
> >>
> >> A) an address for a pte entry we have already cleared, so its invalid and it
> >> will fault (and get serialized behind the PTL).
> >>
> >> or
> >>
> >> B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
> >> entry for the contpte block. But that's ok because the memory still exists
> >> because we haven't yet finished clearing the page table and have not yet issued
> >> the final tlbi.
> >>
> >>
> >>> in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
> >>> seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
> >>> with dropped CONT but still some other PTEs have CONT, we make hardware totally
> >>> confused.
> >>
> >> I suspect this is because in your case you are "misprogramming" the contpte
> >> block; there are *valid* pte entries within the block that disagree about the
> >> contiguous bit or about various other fields. In this case some HW TLB designs
> >> can do weird things. I suspect in your case, that's resulting in accessing bad
> >> memory space and causing an SError, which is trapped by EL3, and the FW is
> >> probably just panicking at that point.
> >
> > you are probably right. as we met the SError, we became very very
> > cautious. so anytime
> > when we flush tlb for a CONTPTE, we strictly do it by
> > 1. set all 16 ptes to zero
> > 2. flush the whole 16 ptes
>
> But my point is that this sequence doesn't guarrantee that the TLB doesn't read
> the page table half way through the SW clearing the 16 entries; a TLB entry can
> be ejected for other reasons than just issuing a TLBI. So in that case these 2
> flows can be equivalent. Its the fact that we are unsetting the valid bit when
> clearing each pte that guarantees this to be safe.
>
> >
> > in your case, it can be:
> > 1. set pte0 to zero
> > 2. flush pte0
> >
> > TBH, i have never tried this. but it might be safe according to your
> > description.
> >
> >>
> >>>
> >>> zap_pte_range() has a force_flush when tlbbatch is full:
> >>>
> >>>                         if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> >>>                                 force_flush = 1;
> >>>                                 addr += PAGE_SIZE;
> >>>                                 break;
> >>>                         }
> >>>
> >>> this means you can expose partial tlbi/flush directly to hardware while some
> >>> other PTEs are still CONT.
> >>
> >> Yes, but that's also possible even if we have a tight loop that clears down the
> >> contpte block; there could still be another core that issues a tlbi while you're
> >> halfway through that loop, or the HW could happen to evict due to TLB pressure
> >> at any time. The point is, it's safe if you are clearing the pte to an *invalid*
> >> entry.
> >>
> >>>
> >>> on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
> >>> on fullmm, as long as zap range covers a large folio, we can flush tlbi for
> >>> those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
> >>> than clearing one PTE.
> >>>
> >>> Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
> >>> of the large folio:
> >>>
> >>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>                       if (pte_cont(ptent)) {
> >>>                               unsigned long next = pte_cont_addr_end(addr, end);
> >>>
> >>>                               if (next - addr != HPAGE_CONT_PTE_SIZE) {
> >>>                                       __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
> >>>                                       /*
> >>>                                        * After splitting cont-pte
> >>>                                        * we need to process pte again.
> >>>                                        */
> >>>                                       goto again_pte;
> >>>                               } else {
> >>>                                       cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
> >>>
> >>>                                       tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
> >>>                                       if (unlikely(!page))
> >>>                                               continue;
> >>>
> >>>                                       if (is_huge_zero_page(page)) {
> >>>                                               tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >>>                                               goto cont_next;
> >>>                                       }
> >>>
> >>>                                       rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
> >>>                                       page_remove_rmap(page, true);
> >>>                                       if (unlikely(page_mapcount(page) < 0))
> >>>                                               print_bad_pte(vma, addr, ptent, page);
> >>>
> >>>                                       tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >>>                               }
> >>> cont_next:
> >>>                               /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
> >>>                               pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> >>>                               addr = next - PAGE_SIZE;
> >>>                               continue;
> >>>                       }
> >>> #endif
> >>>
> >>> this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
> >>> it never requires tlb->fullmm at all.
> >>
> >> Yes, but you are benefitting from the fact that contpte is exposed to core-mm
> >> and it is special-casing them at this level. I'm trying to avoid that.
> >
> > I am thinking we can even do this while we don't expose CONTPTE.
> > if zap_pte_range meets a large folio and the zap_range covers the whole
> > folio, we can flush all ptes in this folio and jump to the end of this folio?
> > i mean
> >
> > if (folio head && range_end > folio_end) {
> >          nr = folio_nr_page(folio);
> >          full_flush_nr_ptes()
> >          pte += nr -1;
> >          addr += (nr - 1) * basepage size
> > }
>
> Just because you found a pte that maps a page from a large folio, that doesn't
> mean that all pages from the folio are mapped, and it doesn't mean they are
> mapped contiguously. We have to deal with partial munmap(), partial mremap()
> etc. We could split in these cases (and in future it might be sensible to try),
> but that can fail (due to GUP). So we still have to handle the corner case.

maybe we can check ptes and find they are all mapped (pte_present),
then do a flush_range.
This is actually the most common case. the majority of madv(DONTNEED)
will benefit from
it. if the condition is false, we fallback to your current code.
zap_pte_range always sets present ptes to 0, but swap entry is really
quite different.

>
> But I can imagine doing a batched version of ptep_get_and_clear(), like I did
> for ptep_set_wrprotects(). And I think this would be an improvement.
>
> The reason I haven't done that so far, is because ptep_get_and_clear() returns
> the pte value when it was cleared and that's hard to do if batching due to the
> storage requirement. But perhaps you could just return the logical OR of the
> dirty and young bits across all ptes in the batch. The caller should be able to
> reconstitute the rest if it needs it?
>
> What do you think?
>
> >
> > zap_pte_range is the most frequent behaviour from userspace libc heap
> > as i explained
> > before. libc can call madvise(DONTNEED) the most often. It is crucial
> > to performance.
> >
> > and this way can also help drop your full version by moving to full
> > flushing the whole
> > large folios? and we don't need to depend on fullmm any more?
> >
> >>
> >> I don't think there is any correctness issue here. But there is a problem with
> >> fragility, as raised by Alistair. I have some ideas on potentially how to solve
> >> that. I'm going to try to work on it this afternoon and will post if I get some
> >> confidence that it is a real solution.
> >>
> >> Thanks,
> >> Ryan
> >>
> >>>
> >>> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> >>>                                      unsigned long addr,
> >>>                                      pte_t *ptep,
> >>>                                      bool flush)
> >>> {
> >>>       pte_t orig_pte = ptep_get(ptep);
> >>>
> >>>       CHP_BUG_ON(!pte_cont(orig_pte));
> >>>       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> >>>       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> >>>
> >>>       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> >>> }
> >>>
> >>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> >>>
> >>>> +     */
> >>>> +
> >>>> +    return __ptep_get_and_clear(mm, addr, ptep);
> >>>> +}
> >>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> >>>> +
> >>>
> >
Thanks
Barry