Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

Ryan Roberts <ryan.roberts@xxxxxxx> · Mon, 15 Apr 2024 14:30:48 +0100

On 15/04/2024 11:57, David Hildenbrand wrote:
> On 15.04.24 11:28, Ryan Roberts wrote:
>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>
>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>> "lockless walkers that never take the PTL".
>>>>
>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>>> you make that clear further down.
>>>
>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>> [...]
>>>
>>>>>>
>>>>>> Could it be this easy? My head is hurting...
>>>>>
>>>>> I think what has to happen is:
>>>>>
>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>> there
>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>
>>>> Today's arm64 ptep_get() guarantees this.
>>>>
>>>>>
>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>
>>>> This isn't really a property of a ptep_get_lockless(); its a statement about a
>>>> class of users. I agree with the statement.
>>>
>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>> fault handlers. Well, mostly "not GUP".
>>>
>>>>
>>>>>
>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>
>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>> bits,
>>>>> so we can race with that. But we don't read garbage.
>>>>
>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>> ptep_get_lockless() implementation.
>>>>
>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>> about
>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
>>>> to be self-consistent except for access and dirty information, which may be
>>>> inconsistent if a racing modification occured".
>>>
>>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there
>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>
>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@xxxxxxxxxx
>>>
>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>> properly notify us when the PTE would lose its dirty bits.
>>>
>>> But getting false-positive dirty bits would be problematic.
>>>
>>>>
>>>> This could mean that the access/dirty state *does* change for a given page
>>>> while
>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>>> that failing to detect this is benign.
>>>
>>> I mean, HW could just set the dirty/access bit immediately after the check. So
>>> if HW concurrently sets the bit and we don't observe that change when we
>>> recheck, I think that would be perfectly fine.
>>
>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>> soft-dirty or uffd-wp).
>>
>> But if you don't want to change the ptep_get_lockless() spec to explicitly allow
>> this (because you have the KVM use case where false-positive dirty is
>> problematic), then I think we are stuck with ptep_get_lockless() as implemented
>> for arm64 today.
> 
> At least regarding the dirty bit, we'd have to guarantee that if
> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
> would be able to catch that.
> 
> Would that be possible?

Hmm maybe. My head hurts. Let me try to work through some examples...

Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in
PTE0 for the contpte block, and it is dirty.

Now let's say there are 2 racing threads:

  - ThreadA is doing a GUP-fast for PTE3
  - ThreadB is remapping order-0 FolioB at PTE0

(ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
example - today's arm64 ptep_get_lockless() can handle the below correctly).

ThreadA					ThreadB
=======					=======

gup_pte_range()
  pte1 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)
					mmap(PTE0)
					  clear_pte(PTE0)
					    unfold(PTE0 - PTE3)
					      WRITE_ONCE(PTE0, 0)
					      WRITE_ONCE(PTE1, 0)
					      WRITE_ONCE(PTE2, 0)
    READ_ONCE(PTE0) (for a/d) << CLEAN!!
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  <do speculative work with pte1 content>
  pte2 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)
    READ_ONCE(PTE0) (for a/d)
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  true = pte_same(pte1, pte2)
					      WRITE_ONCE(PTE3, 0)
					      TLBI
					      WRITE_ONCE(PTE0, <orig & ~CONT>)
					      WRITE_ONCE(PTE1, <orig & ~CONT>)
					      WRITE_ONCE(PTE2, <orig & ~CONT>)
					      WRITE_ONCE(PTE3, <orig & ~CONT>)
					    WRITE_ONCE(PTE0, 0)
					  set_pte_at(PTE0, <new>)

This example shows how a *false-negative* can be returned for the dirty state,
which isn't detected by the check.

I've been unable to come up with an example where a *false-positive* can be
returned for dirty state without the second ptep_get_lockless() noticing. In
this second example, let's assume everything is the same execpt FolioA is
initially clean:

ThreadA					ThreadB
=======					=======

gup_pte_range()
  pte1 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)
					mmap(PTE0)
					  clear_pte(PTE0)
					    unfold(PTE0 - PTE3)
					      WRITE_ONCE(PTE0, 0)
					      WRITE_ONCE(PTE1, 0)
					      WRITE_ONCE(PTE2, 0)
					      WRITE_ONCE(PTE3, 0)
					      TLBI
					      WRITE_ONCE(PTE0, <orig & ~CONT>)
					      WRITE_ONCE(PTE1, <orig & ~CONT>)
					      WRITE_ONCE(PTE2, <orig & ~CONT>)
					      WRITE_ONCE(PTE3, <orig & ~CONT>)
					    WRITE_ONCE(PTE0, 0)
					  set_pte_at(PTE0, <new>)
					write to FolioB - HW sets PTE0's dirty
    READ_ONCE(PTE0) (for a/d) << DIRTY!!
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  <do speculative work with pte1 content>
  pte2 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
    READ_ONCE(PTE0) (for a/d)
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  false = pte_same(pte1, pte2) << So this fails

The only way I can see false-positive not being caught in the second example is
if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
But these lockless table walkers are already suseptible to that.

I think all the same arguments can be extended to the access bit.

For me this is all getting rather subtle and difficult to reason about and even
harder to spec in a comprehensible way. The best I could come up with is:

"All fields in the returned pte are guarranteed to be self-consistent except for
access and dirty information, which may be inconsistent if a racing modification
occured. Additionally it is guranteed that false-positive access and/or dirty
information is not possible if 2 calls are made and both ptes are the same. Only
false-negative access and/or dirty information is possible in this scenario."

which is starting to sound bonkers. Personally I think we are better off at this
point, just keeping today's arm64 ptep_get_lockless().

Thanks,
Ryan