On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote: > ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH: > > pte_t pte; > do { > pte.pte_low = ptep->pte_low; > smp_rmb(); > pte.pte_high = ptep->pte_high; > smp_rmb(); > } while (unlikely(pte.pte_low != ptep->pte_low)); > > It has a comment above it that argues that this is correct because: > 1. A present PTE can't become non-present and then become a present > PTE pointing to another page without a TLB flush in between. > 2. TLB flushes involve IPIs. > > As far as I can tell, in particular on x86, _both_ of those > assumptions are false; perhaps on mips and sh only one of them is? > > Number 2 is straightforward: X86 can run under hypervisors, and when > it runs under hypervisors, the MMU paravirtualization code (including > the KVM version) can implement remote TLB flushes without IPIs. > > Number 1 is gnarlier, because breaking that assumption implies that > there can be a situation where different threads see different memory > at the same virtual address because their TLBs are incoherent. But as > far as I know, it can happen when MADV_DONTNEED races with an > anonymous page fault, because zap_pte_range() does not always flush > stale TLB entries before dropping the page table lock. I think that's > probably fine, since it's a "garbage in, garbage out" kind of > situation - but if a concurrent GUP-fast can then theoretically end up > returning a completely unrelated page, that's bad. > > > Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't > just change ptep_get_lockless() to use arch_cmpxchg_double() and be > done with it... I think the argument here has nothing to do with IPIs, but is more a statement on memory ordering. What we want to get is a non-torn load of low/high, under some restricted rules. PTE writes should be ordered so that the present/not present bit is properly: Zapping a PTE: write_low (not present) wmb() write_high (a) wmb() Reestablish a PTE: write_high (b) wmb() write_low (present) wmb() This ordering is necessary to make the TLB's atomic 64 bit load work properly, otherwise the TLB could read a present entry with a bogus other half! For ptep_get_lockless() we define non-torn as meaning the same as for the TLB: pre-zap low / high (present) restablish low / high (b) (present) any low / any high (not present) Other combinations are forbidden. The read side has a corresponding list of reads: read_low rmb() read_high rmb() read_low So, it seems plausible this could be OK based only on atomics (I did not check that the present bit is properly placed in the right low/high). Do you see a way the atomics don't work out? Jason