On 15/04/2024 16:22, David Hildenbrand wrote: > On 15.04.24 17:17, Ryan Roberts wrote: >> On 15/04/2024 15:58, David Hildenbrand wrote: >>> On 15.04.24 16:34, Ryan Roberts wrote: >>>> On 15/04/2024 15:23, David Hildenbrand wrote: >>>>> On 15.04.24 15:30, Ryan Roberts wrote: >>>>>> On 15/04/2024 11:57, David Hildenbrand wrote: >>>>>>> On 15.04.24 11:28, Ryan Roberts wrote: >>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>>>>>> "lockless walkers that never take the PTL". >>>>>>>>>> >>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the >>>>>>>>>> TLBIs). But >>>>>>>>>> you make that clear further down. >>>>>>>>> >>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page >>>>>>>>> tables >>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific >>>>>>>>> and >>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>>>>>> [...] >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Could it be this easy? My head is hurting... >>>>>>>>>>> >>>>>>>>>>> I think what has to happen is: >>>>>>>>>>> >>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as >>>>>>>>>>> long as >>>>>>>>>>> there >>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>>>>>> >>>>>>>>>> Today's arm64 ptep_get() guarantees this. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can >>>>>>>>>>> handle >>>>>>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>>>>>> >>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>>>>>> about a >>>>>>>>>> class of users. I agree with the statement. >>>>>>>>> >>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as >>>>>>>>> page >>>>>>>>> fault handlers. Well, mostly "not GUP". >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot >>>>>>>>>>> handle >>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>>>>>> >>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check >>>>>>>>>>> later if >>>>>>>>>>> the >>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB >>>>>>>>>>> flushes >>>>>>>>>>> required. This is the common case. HW might concurrently set >>>>>>>>>>> access/dirty >>>>>>>>>>> bits, >>>>>>>>>>> so we can race with that. But we don't read garbage. >>>>>>>>>> >>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>>>>>> ptep_get_lockless() implementation. >>>>>>>>>> >>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually >>>>>>>>>> care >>>>>>>>>> about >>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>>>>>> guarranteed >>>>>>>>>> to be self-consistent except for access and dirty information, which >>>>>>>>>> may be >>>>>>>>>> inconsistent if a racing modification occured". >>>>>>>>> >>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such >>>>>>>>> that >>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>>>>>> there >>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>>>>>> >>>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@xxxxxxxxxx >>>>>>>>> >>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>>>>>> properly notify us when the PTE would lose its dirty bits. >>>>>>>>> >>>>>>>>> But getting false-positive dirty bits would be problematic. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> This could mean that the access/dirty state *does* change for a given >>>>>>>>>> page >>>>>>>>>> while >>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I >>>>>>>>>> *think* >>>>>>>>>> that failing to detect this is benign. >>>>>>>>> >>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the >>>>>>>>> check. So >>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>>>>>> recheck, I think that would be perfectly fine. >>>>>>>> >>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>>>>>> soft-dirty or uffd-wp). >>>>>>>> >>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>>>>>> allow >>>>>>>> this (because you have the KVM use case where false-positive dirty is >>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as >>>>>>>> implemented >>>>>>>> for arm64 today. >>>>>>> >>>>>>> At least regarding the dirty bit, we'd have to guarantee that if >>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>>>>>> would be able to catch that. >>>>>>> >>>>>>> Would that be possible? >>>>>> >>>>>> Hmm maybe. My head hurts. Let me try to work through some examples... >>>>>> >>>>>> >>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs >>>>>> 0, 1, >>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is >>>>>> stored in >>>>>> PTE0 for the contpte block, and it is dirty. >>>>>> >>>>>> Now let's say there are 2 racing threads: >>>>>> >>>>>> - ThreadA is doing a GUP-fast for PTE3 >>>>>> - ThreadB is remapping order-0 FolioB at PTE0 >>>>>> >>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly). >>>>>> >>>>>> ThreadA ThreadB >>>>>> ======= ======= >>>>>> >>>>>> gup_pte_range() >>>>>> pte1 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> mmap(PTE0) >>>>>> clear_pte(PTE0) >>>>>> unfold(PTE0 - PTE3) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> WRITE_ONCE(PTE1, 0) >>>>>> WRITE_ONCE(PTE2, 0) >>>>>> READ_ONCE(PTE0) (for a/d) << CLEAN!! >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> <do speculative work with pte1 content> >>>>>> pte2 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> READ_ONCE(PTE0) (for a/d) >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> true = pte_same(pte1, pte2) >>>>>> WRITE_ONCE(PTE3, 0) >>>>>> TLBI >>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> set_pte_at(PTE0, <new>) >>>>>> >>>>>> This example shows how a *false-negative* can be returned for the dirty >>>>>> state, >>>>>> which isn't detected by the check. >>>>>> >>>>>> I've been unable to come up with an example where a *false-positive* can be >>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In >>>>>> this second example, let's assume everything is the same execpt FolioA is >>>>>> initially clean: >>>>>> >>>>>> ThreadA ThreadB >>>>>> ======= ======= >>>>>> >>>>>> gup_pte_range() >>>>>> pte1 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> mmap(PTE0) >>>>>> clear_pte(PTE0) >>>>>> unfold(PTE0 - PTE3) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> WRITE_ONCE(PTE1, 0) >>>>>> WRITE_ONCE(PTE2, 0) >>>>>> WRITE_ONCE(PTE3, 0) >>>>>> TLBI >>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> set_pte_at(PTE0, <new>) >>>>>> write to FolioB - HW sets PTE0's dirty >>>>>> READ_ONCE(PTE0) (for a/d) << DIRTY!! >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> <do speculative work with pte1 content> >>>>>> pte2 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >>>>>> READ_ONCE(PTE0) (for a/d) >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> false = pte_same(pte1, pte2) << So this fails >>>>>> >>>>>> The only way I can see false-positive not being caught in the second >>>>>> example is >>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA >>>>>> scenario. >>>>>> But these lockless table walkers are already suseptible to that. >>>>>> >>>>>> I think all the same arguments can be extended to the access bit. >>>>>> >>>>>> >>>>>> For me this is all getting rather subtle and difficult to reason about and >>>>>> even >>>>>> harder to spec in a comprehensible way. The best I could come up with is: >>>>>> >>>>>> "All fields in the returned pte are guarranteed to be self-consistent except >>>>>> for >>>>>> access and dirty information, which may be inconsistent if a racing >>>>>> modification >>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty >>>>>> information is not possible if 2 calls are made and both ptes are the same. >>>>>> Only >>>>>> false-negative access and/or dirty information is possible in this scenario." >>>>>> >>>>>> which is starting to sound bonkers. Personally I think we are better off at >>>>>> this >>>>>> point, just keeping today's arm64 ptep_get_lockless(). >>>>> >>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that >>>>> would sync against GUP-fast? >>>> >>>> No, the broadcast is in HW. There is no IPI. >>> >>> Okay ... >>> >>> I agree that the semantics are a bit weird, but if we could get rid of >>> ptep_get_lockless() on arm64, that would also be nice. >>> >>> >>> Something I've been thinking of ... just to share what I've had in mind. The two >>> types of users we currently have are: >>> >>> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >>> >>> (2) ptep_get_lockless() followed by ptep_get() check without PTL. >>> >>> What if we had the following instead: >>> >>> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >>> >>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without >>> PTL. >>> >>> And on arm64 let >>> >>> (1) ptep_get_lockless() be ptep_get() >>> >>> (2) ptep_get_gup_fast() be __ptep_get(). >>> >>> That would mean, that (2) would not care if another cont-pte is dirty, because >>> we don't collect access+dirty bits. That way, we avoid any races with concurrent >>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes() >>> would have to set all cont-PTEs dirty, even if any of these already is dirty. >> >> I don't think the "problematic" thing is actually a problem; set_ptes() will >> always set the dirty bit to the same value for all ptes it covers (and if you do >> set_ptes() on a partial contpte block, it will be unfolded first). Although I >> suspect I've misunderstood what you meant there... > > It's more code like that following that I am concerned about. > > if (pte_dirty()) { > /* Great, nothing to do */ > } else > mte_mkdirty(); > set_ptes(); > ... > } OK I see, so you're worried about uneccessary unfolding that the false-negative dirty reporting could cause? I think the best solution there would be for the core to use the clear_young_dirty_ptes(CYDP_CLEAR_DIRTY) API that Lance adds in his series at [1]. That would avoid any unfolding and just dirty all contpte block(s) touched by the range. [1] https://lore.kernel.org/linux-mm/20240413002219.71246-1-ioworker0@xxxxxxxxx/ > >> >> The potential problem I see with this is that the Arm ARM doesn't specify which >> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >> randomly and this could spuriously increase your check failure rate. In reality >> I believe most implementations will update the PTE for the address that caused >> the TLB to be populated. But in some cases, you could have eviction (due to >> pressure or explicit invalidation) followed by re-population due to faulting on >> a different page of the contpte block. In this case you would see this type of >> problem too. >> >> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning >> potentially false-negatives for access and dirty? Just with a much higher chance >> of getting a false-negative. How is this helping? > > You are performing an atomic read like GUP-fast wants you to. So there are no > races to worry about like on other architectures: HW might *set* the dirty bit > concurrently, but that's just fine. But you can still see false-negatives for access and dirty... > > The whole races you describe with concurrent folding/unfolding/ ... are irrelevant. And I think I convinced myself that you will only see false-negatives with today's arm64 ptep_get(). But an order or magnitude fewer than with your proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those). > > To me that sounds ... much simpler ;) But again, just something I've been > thinking about. OK so this approach upgrades my "I'm fairly sure we never see false-positives" to "we definitely never see false-positives". But it certainly increases the quantity of false-negatives. > > The reuse of pte_get_lockless() outside GUP code might not have been the wisest > choice. > If you want to go down the ptep_get_gup_fast() route, you've still got to be able to spec it, and I think it will land pretty close to my most recent stab at respec'ing ptep_get_lockless() a couple of replies up on this thread. Where would your proposal leave the KVM use case? If you call it ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would be left with ptep_get()... Sorry this thread is getting so long. Just to summarise, I think there are currently 3 solutions on the table: - ptep_get_lockless() remains as is - ptep_get_lockless() wraps ptep_get() - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) Based on discussion so far, that's also the order of my preference. Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? I think its just the potential for looping in the face of concurrent modifications?