Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

David Hildenbrand <david@xxxxxxxxxx> · Mon, 15 Apr 2024 17:22:30 +0200

On 15.04.24 17:17, Ryan Roberts wrote:
On 15/04/2024 15:58, David Hildenbrand wrote:
On 15.04.24 16:34, Ryan Roberts wrote:
On 15/04/2024 15:23, David Hildenbrand wrote:
On 15.04.24 15:30, Ryan Roberts wrote:
On 15/04/2024 11:57, David Hildenbrand wrote:
On 15.04.24 11:28, Ryan Roberts wrote:
On 12/04/2024 21:16, David Hildenbrand wrote:

Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".

Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the
TLBIs). But
you make that clear further down.

Yes, but disabling interrupts is also required for RCU-freeing of page
tables
such that they can be walked safely. The TLB flush IPI is arch-specific and
indeed to sync against PTE invalidation (before generic GUP-fast).
[...]

Could it be this easy? My head is hurting...

I think what has to happen is:

(1) pte_get_lockless() must return the same value as ptep_get() as long as
there
are no races. No removal/addition of access/dirty bits etc.

Today's arm64 ptep_get() guarantees this.

(2) Lockless page table walkers that later verify under the PTL can handle
serious "garbage PTEs". This is our page fault handler.

This isn't really a property of a ptep_get_lockless(); its a statement
about a
class of users. I agree with the statement.

Yes. That's a requirement for the user of ptep_get_lockless(), such as page
fault handlers. Well, mostly "not GUP".

(3) Lockless page table walkers that cannot verify under PTL cannot handle
arbitrary garbage PTEs. This is GUP-fast. Two options:

(3a) pte_get_lockless() can atomically read the PTE: We re-check later if
the
atomically-read PTE is still unchanged (without PTL). No IPI for TLB
flushes
required. This is the common case. HW might concurrently set access/dirty
bits,
so we can race with that. But we don't read garbage.

Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.

But the point I was trying to make is that GUP-fast does not actually care
about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are
guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".

We *might* have KVM in the future want to check that a PTE is dirty, such
that
we can only allow dirty PTEs to be writable in a secondary MMU. That's not
there
yet, but one thing I was discussing on the list recently. Burried in:

https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@xxxxxxxxxx

We wouldn't care about racing modifications, as long as MMU notifiers will
properly notify us when the PTE would lose its dirty bits.

But getting false-positive dirty bits would be problematic.

This could mean that the access/dirty state *does* change for a given page
while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
*think*
that failing to detect this is benign.

I mean, HW could just set the dirty/access bit immediately after the
check. So
if HW concurrently sets the bit and we don't observe that change when we
recheck, I think that would be perfectly fine.

Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
soft-dirty or uffd-wp).

But if you don't want to change the ptep_get_lockless() spec to explicitly
allow
this (because you have the KVM use case where false-positive dirty is
problematic), then I think we are stuck with ptep_get_lockless() as
implemented
for arm64 today.

At least regarding the dirty bit, we'd have to guarantee that if
ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
would be able to catch that.

Would that be possible?

Hmm maybe. My head hurts. Let me try to work through some examples...

Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
stored in
PTE0 for the contpte block, and it is dirty.

Now let's say there are 2 racing threads:

     - ThreadA is doing a GUP-fast for PTE3
     - ThreadB is remapping order-0 FolioB at PTE0

(ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
example - today's arm64 ptep_get_lockless() can handle the below correctly).

ThreadA                    ThreadB
=======                    =======

gup_pte_range()
     pte1 = ptep_get_lockless(PTE3)
       READ_ONCE(PTE3)
                      mmap(PTE0)
                        clear_pte(PTE0)
                          unfold(PTE0 - PTE3)
                            WRITE_ONCE(PTE0, 0)
                            WRITE_ONCE(PTE1, 0)
                            WRITE_ONCE(PTE2, 0)
       READ_ONCE(PTE0) (for a/d) << CLEAN!!
       READ_ONCE(PTE1) (for a/d)
       READ_ONCE(PTE2) (for a/d)
       READ_ONCE(PTE3) (for a/d)
     <do speculative work with pte1 content>
     pte2 = ptep_get_lockless(PTE3)
       READ_ONCE(PTE3)
       READ_ONCE(PTE0) (for a/d)
       READ_ONCE(PTE1) (for a/d)
       READ_ONCE(PTE2) (for a/d)
       READ_ONCE(PTE3) (for a/d)
     true = pte_same(pte1, pte2)
                            WRITE_ONCE(PTE3, 0)
                            TLBI
                            WRITE_ONCE(PTE0, <orig & ~CONT>)
                            WRITE_ONCE(PTE1, <orig & ~CONT>)
                            WRITE_ONCE(PTE2, <orig & ~CONT>)
                            WRITE_ONCE(PTE3, <orig & ~CONT>)
                          WRITE_ONCE(PTE0, 0)
                        set_pte_at(PTE0, <new>)

This example shows how a *false-negative* can be returned for the dirty state,
which isn't detected by the check.

I've been unable to come up with an example where a *false-positive* can be
returned for dirty state without the second ptep_get_lockless() noticing. In
this second example, let's assume everything is the same execpt FolioA is
initially clean:

ThreadA                    ThreadB
=======                    =======

gup_pte_range()
     pte1 = ptep_get_lockless(PTE3)
       READ_ONCE(PTE3)
                      mmap(PTE0)
                        clear_pte(PTE0)
                          unfold(PTE0 - PTE3)
                            WRITE_ONCE(PTE0, 0)
                            WRITE_ONCE(PTE1, 0)
                            WRITE_ONCE(PTE2, 0)
                            WRITE_ONCE(PTE3, 0)
                            TLBI
                            WRITE_ONCE(PTE0, <orig & ~CONT>)
                            WRITE_ONCE(PTE1, <orig & ~CONT>)
                            WRITE_ONCE(PTE2, <orig & ~CONT>)
                            WRITE_ONCE(PTE3, <orig & ~CONT>)
                          WRITE_ONCE(PTE0, 0)
                        set_pte_at(PTE0, <new>)
                      write to FolioB - HW sets PTE0's dirty
       READ_ONCE(PTE0) (for a/d) << DIRTY!!
       READ_ONCE(PTE1) (for a/d)
       READ_ONCE(PTE2) (for a/d)
       READ_ONCE(PTE3) (for a/d)
     <do speculative work with pte1 content>
     pte2 = ptep_get_lockless(PTE3)
       READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
       READ_ONCE(PTE0) (for a/d)
       READ_ONCE(PTE1) (for a/d)
       READ_ONCE(PTE2) (for a/d)
       READ_ONCE(PTE3) (for a/d)
     false = pte_same(pte1, pte2) << So this fails

The only way I can see false-positive not being caught in the second example is
if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
But these lockless table walkers are already suseptible to that.

I think all the same arguments can be extended to the access bit.

For me this is all getting rather subtle and difficult to reason about and even
harder to spec in a comprehensible way. The best I could come up with is:

"All fields in the returned pte are guarranteed to be self-consistent except
for
access and dirty information, which may be inconsistent if a racing
modification
occured. Additionally it is guranteed that false-positive access and/or dirty
information is not possible if 2 calls are made and both ptes are the same.
Only
false-negative access and/or dirty information is possible in this scenario."

which is starting to sound bonkers. Personally I think we are better off at
this
point, just keeping today's arm64 ptep_get_lockless().

Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
would sync against GUP-fast?

No, the broadcast is in HW. There is no IPI.

Okay ...

I agree that the semantics are a bit weird, but if we could get rid of
ptep_get_lockless() on arm64, that would also be nice.

Something I've been thinking of ... just to share what I've had in mind. The two
types of users we currently have are:

(1) ptep_get_lockless() followed by ptep_get() check under PTL.

(2) ptep_get_lockless() followed by ptep_get() check without PTL.

What if we had the following instead:

(1) ptep_get_lockless() followed by ptep_get() check under PTL.

(2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
     PTL.

And on arm64 let

(1) ptep_get_lockless() be ptep_get()

(2) ptep_get_gup_fast() be __ptep_get().

That would mean, that (2) would not care if another cont-pte is dirty, because
we don't collect access+dirty bits. That way, we avoid any races with concurrent
unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
would have to set all cont-PTEs dirty, even if any of these already is dirty.

I don't think the "problematic" thing is actually a problem; set_ptes() will
always set the dirty bit to the same value for all ptes it covers (and if you do
set_ptes() on a partial contpte block, it will be unfolded first). Although I
suspect I've misunderstood what you meant there...

It's more code like that following that I am concerned about.

if (pte_dirty()) {
	/* Great, nothing to do */
} else
	mte_mkdirty();
	set_ptes();
	...
}

The potential problem I see with this is that the Arm ARM doesn't specify which
PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
randomly and this could spuriously increase your check failure rate. In reality
I believe most implementations will update the PTE for the address that caused
the TLB to be populated. But in some cases, you could have eviction (due to
pressure or explicit invalidation) followed by re-population due to faulting on
a different page of the contpte block. In this case you would see this type of
problem too.

But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
potentially false-negatives for access and dirty? Just with a much higher chance
of getting a false-negative. How is this helping?

You are performing an atomic read like GUP-fast wants you to. So there 
are no races to worry about like on other architectures: HW might *set* 
the dirty bit concurrently, but that's just fine.

The whole races you describe with concurrent folding/unfolding/ ... are 
irrelevant.

To me that sounds ... much simpler ;) But again, just something I've 
been thinking about.

The reuse of pte_get_lockless() outside GUP code might not have been the 
wisest choice.

--
Cheers,

David / dhildenb