Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 11 Apr 2024 10:45:38 +0100

On 10/04/2024 21:09, David Hildenbrand wrote:
> [...]
> 
> Skipping the ptdesc stuff we discussed offline, to not get distracted. :)
> 
> This stuff is killing me, sorry for the lengthy reply ...

No problem - thanks for taking the time to think it through and reply with such
clarity. :)

> 
>>
>> So I've been looking at all this again, and getting myself even more confused.
>>
>> I believe there are 2 classes of ptep_get_lockless() caller:
>>
>> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
>> 2) everyone else
> 
> Likely only completely lockless page table walkers where we *cannot* recheck
> under PTL is special. Essentially where we disable interrupts and rely on TLB
> flushes to sync against concurrent changes.

Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".

Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
you make that clear further down.

> 
> Let's take a look where ptep_get_lockless() comes from:
> 
> commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
> Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Date:   Fri Nov 13 11:41:40 2020 +0100
> 
>     mm/gup: Provide gup_get_pte() more generic
> 
>     In order to write another lockless page-table walker, we need
>     gup_get_pte() exposed. While doing that, rename it to
>     ptep_get_lockless() to match the existing ptep_get() naming.
> 
> 
> GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast.
> 
> "With get_user_pages_fast(), we walk down the pagetables without taking any
> locks.  For this we would like to load the pointers atomically, but sometimes
> that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What we
> do have is the guarantee that a PTE will only either go from not present to
> present, or present to not present or both -- it will not switch to a completely
> different present page without a TLB flush in between; something hat we are
> blocking by holding interrupts off."
> 
> Later, we added support for GUP-fast that introduced the !TLB variant:
> 
> commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> Author: Steve Capper <steve.capper@xxxxxxxxxx>
> Date:   Thu Oct 9 15:29:14 2014 -0700
> 
>     mm: introduce a general RCU get_user_pages_fast()
> 
> With the pattern
> 
> /*
>  * In the line below we are assuming that the pte can be read
>  * atomically. If this is not the case for your architecture,
>  * please wrap this in a helper function!
>  *
>  * for an example see gup_get_pte in arch/x86/mm/gup.c
>  */
> pte_t pte = ACCESS_ONCE(*ptep);
> ...
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> ...
> 
> 
> Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
> gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was
> required to be sane, this the lengthy comment above ptep_get_lockless() that
> talks about TLB flushes.
> 
> The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still
> full of details about TLB flushes syncing against GUP-fast. But as you note, we
> use it even in contexts where we don't disable interrupts and the TLB flush
> can't help.
> 
> We don't disable interrupts during page faults ... so most of the things
> described in ptep_get_lockless() don't really apply.
> 
> That's also the reason why ...
>>
>>                  vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> -               vmf->orig_pte = *vmf->pte;
>> +               vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>                  vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>> -               /*
>> -                * some architectures can have larger ptes than wordsize,
>> -                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
>> -                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
>> -                * accesses.  The code below just needs a consistent view
>> -                * for the ifs and we later double check anyway with the
>> -                * ptl lock held. So here a barrier will do.
>> -                */
>> -               barrier();
>>                  if (pte_none(vmf->orig_pte)) {
> 
> ... that code was in place. We would possibly read garbage PTEs, but would
> recheck *under PTL* (where the PTE can no longer change) that what we read
> wasn't garbage and didn't change.

Agreed.

> 
>>
>> -- 
>>
>> (2) All the other users require that a subset of the pte fields are
>> self-consistent; specifically they don't care about access, dirty, uffd-wp or
>> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
>> just by calling ptep_get().
> 
> Let's focus on access+dirty for now ;)
> 
>>
>> -- 
>>
>> So, I'm making the bold claim that it was never neccessary to specialize
>> pte_get_lockless() on arm64, and I *think* we could just delete it so that
>> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
>> aim without needing to introduce "norecency" variants.
>>
>> Additionally I propose documenting ptep_get_lockless() to describe the set of
>> fields that are guarranteed to be self-consistent and the remaining fields which
>> are self-consistent only with best-effort.
>>
>> Could it be this easy? My head is hurting...
> 
> I think what has to happen is:
> 
> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
> are no races. No removal/addition of access/dirty bits etc.

Today's arm64 ptep_get() guarantees this.

> 
> (2) Lockless page table walkers that later verify under the PTL can handle
> serious "garbage PTEs". This is our page fault handler.

This isn't really a property of a ptep_get_lockless(); its a statement about a
class of users. I agree with the statement.

> 
> (3) Lockless page table walkers that cannot verify under PTL cannot handle
> arbitrary garbage PTEs. This is GUP-fast. Two options:
> 
> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
> required. This is the common case. HW might concurrently set access/dirty bits,
> so we can race with that. But we don't read garbage.

Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.

But the point I was trying to make is that GUP-fast does not actually care about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".

This could mean that the access/dirty state *does* change for a given page while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
that failing to detect this is benign.

Aside: GUP-fast currently rechecks the pte originally obtained with
ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
to (1), so either it returns the same pte or it returns a different pte or
garbage. But that garbage could just happen to be the same as the originally
obtained pte. So in that case, it would have a false match. I think this needs
to be changed to ptep_get_lockless()?

> 
> (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to
> read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious
> PTE changes (e.g., present -> present). This is weird x86-PAE.
> 
> 
> If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.
> 
> My head is hurting ...
>