Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

David Hildenbrand <david@xxxxxxxxxx> · Mon, 15 Apr 2024 18:02:26 +0200

The potential problem I see with this is that the Arm ARM doesn't specify which
PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
randomly and this could spuriously increase your check failure rate. In reality
I believe most implementations will update the PTE for the address that caused
the TLB to be populated. But in some cases, you could have eviction (due to
pressure or explicit invalidation) followed by re-population due to faulting on
a different page of the contpte block. In this case you would see this type of
problem too.

But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
potentially false-negatives for access and dirty? Just with a much higher chance
of getting a false-negative. How is this helping?

You are performing an atomic read like GUP-fast wants you to. So there are no
races to worry about like on other architectures: HW might *set* the dirty bit
concurrently, but that's just fine.

But you can still see false-negatives for access and dirty...

Yes.

The whole races you describe with concurrent folding/unfolding/ ... are irrelevant.

And I think I convinced myself that you will only see false-negatives with
today's arm64 ptep_get(). But an order or magnitude fewer than with your
proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those).

To me that sounds ... much simpler ;) But again, just something I've been
thinking about.

OK so this approach upgrades my "I'm fairly sure we never see false-positives"
to "we definitely never see false-positives". But it certainly increases the
quantity of false-negatives.

Yes.

The reuse of pte_get_lockless() outside GUP code might not have been the wisest
choice.

If you want to go down the ptep_get_gup_fast() route, you've still got to be
able to spec it, and I think it will land pretty close to my most recent stab at
respec'ing ptep_get_lockless() a couple of replies up on this thread.

Where would your proposal leave the KVM use case? If you call it
ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
be left with ptep_get()...

It's using GUP-fast.

Sorry this thread is getting so long. Just to summarise, I think there are
currently 3 solutions on the table:

   - ptep_get_lockless() remains as is
   - ptep_get_lockless() wraps ptep_get()
   - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)

Based on discussion so far, that's also the order of my preference.

(1) seems like the easiest thing to do.

Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?

Well, you sent that patch series with "that aims to reduce the cost and 
complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve 
that. :)

--
Cheers,

David / dhildenb