On 10/04/2024 21:09, David Hildenbrand wrote: > [...] > > Skipping the ptdesc stuff we discussed offline, to not get distracted. :) > > This stuff is killing me, sorry for the lengthy reply ... No problem - thanks for taking the time to think it through and reply with such clarity. :) > >> >> So I've been looking at all this again, and getting myself even more confused. >> >> I believe there are 2 classes of ptep_get_lockless() caller: >> >> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() >> 2) everyone else > > Likely only completely lockless page table walkers where we *cannot* recheck > under PTL is special. Essentially where we disable interrupts and rely on TLB > flushes to sync against concurrent changes. Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and "lockless walkers that never take the PTL". Detail: the part about disabling interrupts and TLB flush syncing is arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But you make that clear further down. > > Let's take a look where ptep_get_lockless() comes from: > > commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3 > Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Date: Fri Nov 13 11:41:40 2020 +0100 > > mm/gup: Provide gup_get_pte() more generic > > In order to write another lockless page-table walker, we need > gup_get_pte() exposed. While doing that, rename it to > ptep_get_lockless() to match the existing ptep_get() naming. > > > GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast. > > "With get_user_pages_fast(), we walk down the pagetables without taking any > locks. For this we would like to load the pointers atomically, but sometimes > that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we > do have is the guarantee that a PTE will only either go from not present to > present, or present to not present or both -- it will not switch to a completely > different present page without a TLB flush in between; something hat we are > blocking by holding interrupts off." > > Later, we added support for GUP-fast that introduced the !TLB variant: > > commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 > Author: Steve Capper <steve.capper@xxxxxxxxxx> > Date: Thu Oct 9 15:29:14 2014 -0700 > > mm: introduce a general RCU get_user_pages_fast() > > With the pattern > > /* > * In the line below we are assuming that the pte can be read > * atomically. If this is not the case for your architecture, > * please wrap this in a helper function! > * > * for an example see gup_get_pte in arch/x86/mm/gup.c > */ > pte_t pte = ACCESS_ONCE(*ptep); > ... > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > ... > > > Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = > gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was > required to be sane, this the lengthy comment above ptep_get_lockless() that > talks about TLB flushes. > > The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still > full of details about TLB flushes syncing against GUP-fast. But as you note, we > use it even in contexts where we don't disable interrupts and the TLB flush > can't help. > > We don't disable interrupts during page faults ... so most of the things > described in ptep_get_lockless() don't really apply. > > That's also the reason why ... >> >> vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> - vmf->orig_pte = *vmf->pte; >> + vmf->orig_pte = ptep_get_lockless(vmf->pte); >> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >> >> - /* >> - * some architectures can have larger ptes than wordsize, >> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and >> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic >> - * accesses. The code below just needs a consistent view >> - * for the ifs and we later double check anyway with the >> - * ptl lock held. So here a barrier will do. >> - */ >> - barrier(); >> if (pte_none(vmf->orig_pte)) { > > ... that code was in place. We would possibly read garbage PTEs, but would > recheck *under PTL* (where the PTE can no longer change) that what we read > wasn't garbage and didn't change. Agreed. > >> >> -- >> >> (2) All the other users require that a subset of the pte fields are >> self-consistent; specifically they don't care about access, dirty, uffd-wp or >> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent >> just by calling ptep_get(). > > Let's focus on access+dirty for now ;) > >> >> -- >> >> So, I'm making the bold claim that it was never neccessary to specialize >> pte_get_lockless() on arm64, and I *think* we could just delete it so that >> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original >> aim without needing to introduce "norecency" variants. >> >> Additionally I propose documenting ptep_get_lockless() to describe the set of >> fields that are guarranteed to be self-consistent and the remaining fields which >> are self-consistent only with best-effort. >> >> Could it be this easy? My head is hurting... > > I think what has to happen is: > > (1) pte_get_lockless() must return the same value as ptep_get() as long as there > are no races. No removal/addition of access/dirty bits etc. Today's arm64 ptep_get() guarantees this. > > (2) Lockless page table walkers that later verify under the PTL can handle > serious "garbage PTEs". This is our page fault handler. This isn't really a property of a ptep_get_lockless(); its a statement about a class of users. I agree with the statement. > > (3) Lockless page table walkers that cannot verify under PTL cannot handle > arbitrary garbage PTEs. This is GUP-fast. Two options: > > (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the > atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes > required. This is the common case. HW might concurrently set access/dirty bits, > so we can race with that. But we don't read garbage. Today's arm64 ptep_get() cannot garantee that the access/dirty bits are consistent for contpte ptes. That's the bit that complicates the current ptep_get_lockless() implementation. But the point I was trying to make is that GUP-fast does not actually care about *all* the fields being consistent (e.g. access/dirty). So we could spec pte_get_lockless() to say that "all fields in the returned pte are guarranteed to be self-consistent except for access and dirty information, which may be inconsistent if a racing modification occured". This could mean that the access/dirty state *does* change for a given page while GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* that failing to detect this is benign. Aside: GUP-fast currently rechecks the pte originally obtained with ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform to (1), so either it returns the same pte or it returns a different pte or garbage. But that garbage could just happen to be the same as the originally obtained pte. So in that case, it would have a false match. I think this needs to be changed to ptep_get_lockless()? > > (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to > read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious > PTE changes (e.g., present -> present). This is weird x86-PAE. > > > If ptep_get() on arm64 can do (1), (2) and (3a), we might be good. > > My head is hurting ... >