Hi David, Sorry for the slow reply on this; its was due to a combination of thinking a bit more about the options here and being out on holiday. On 15/04/2024 17:02, David Hildenbrand wrote: >>>> The potential problem I see with this is that the Arm ARM doesn't specify which >>>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >>>> randomly and this could spuriously increase your check failure rate. In reality >>>> I believe most implementations will update the PTE for the address that caused >>>> the TLB to be populated. But in some cases, you could have eviction (due to >>>> pressure or explicit invalidation) followed by re-population due to faulting on >>>> a different page of the contpte block. In this case you would see this type of >>>> problem too. >>>> >>>> But ultimately, isn't this basically equivalent to ptep_get_lockless() >>>> returning >>>> potentially false-negatives for access and dirty? Just with a much higher >>>> chance >>>> of getting a false-negative. How is this helping? >>> >>> You are performing an atomic read like GUP-fast wants you to. So there are no >>> races to worry about like on other architectures: HW might *set* the dirty bit >>> concurrently, but that's just fine. >> >> But you can still see false-negatives for access and dirty... > > Yes. > >> >>> >>> The whole races you describe with concurrent folding/unfolding/ ... are >>> irrelevant. >> >> And I think I convinced myself that you will only see false-negatives with >> today's arm64 ptep_get(). But an order or magnitude fewer than with your >> proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of >> those). >> >>> >>> To me that sounds ... much simpler ;) But again, just something I've been >>> thinking about. >> >> OK so this approach upgrades my "I'm fairly sure we never see false-positives" >> to "we definitely never see false-positives". But it certainly increases the >> quantity of false-negatives. > > Yes. > >> >>> >>> The reuse of pte_get_lockless() outside GUP code might not have been the wisest >>> choice. >>> >> >> If you want to go down the ptep_get_gup_fast() route, you've still got to be >> able to spec it, and I think it will land pretty close to my most recent stab at >> respec'ing ptep_get_lockless() a couple of replies up on this thread. >> >> Where would your proposal leave the KVM use case? If you call it >> ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would >> be left with ptep_get()... > > It's using GUP-fast. > >> >> Sorry this thread is getting so long. Just to summarise, I think there are >> currently 3 solutions on the table: >> >> - ptep_get_lockless() remains as is >> - ptep_get_lockless() wraps ptep_get() >> - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) >> >> Based on discussion so far, that's also the order of my preference. > > (1) seems like the easiest thing to do. Yes, I'm very much in favour of easy. > >> >> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? > > Well, you sent that patch series with "that aims to reduce the cost and > complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :) Touche! I'd half forgotten that we were having this conversation in the context of this series! I guess your ptep_get_gup_fast() approach is very similar to ptep_get_lockless_norecency()... So we are back to the beginning :) But ultimately I've come to the conclusion that it is easy to reason about the current arm64 ptep_get_lockless() implementation and see that its correct. The other options both have their drawbacks. Yes, there is a loop in the current implementation that would be nice to get rid of, but I don't think it is really any worse than the cmpxchg loops we already have in other helpers. I'm not planning to persue this any further. Thanks for the useful discussion (as always).