On Thu, Oct 6, 2022 at 5:44 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote: > > ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH: > > > > pte_t pte; > > do { > > pte.pte_low = ptep->pte_low; > > smp_rmb(); > > pte.pte_high = ptep->pte_high; > > smp_rmb(); > > } while (unlikely(pte.pte_low != ptep->pte_low)); > > > > It has a comment above it that argues that this is correct because: > > 1. A present PTE can't become non-present and then become a present > > PTE pointing to another page without a TLB flush in between. > > 2. TLB flushes involve IPIs. > > > > As far as I can tell, in particular on x86, _both_ of those > > assumptions are false; perhaps on mips and sh only one of them is? > > > > Number 2 is straightforward: X86 can run under hypervisors, and when > > it runs under hypervisors, the MMU paravirtualization code (including > > the KVM version) can implement remote TLB flushes without IPIs. > > > > Number 1 is gnarlier, because breaking that assumption implies that > > there can be a situation where different threads see different memory > > at the same virtual address because their TLBs are incoherent. But as > > far as I know, it can happen when MADV_DONTNEED races with an > > anonymous page fault, because zap_pte_range() does not always flush > > stale TLB entries before dropping the page table lock. I think that's > > probably fine, since it's a "garbage in, garbage out" kind of > > situation - but if a concurrent GUP-fast can then theoretically end up > > returning a completely unrelated page, that's bad. > > > > > > Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't > > just change ptep_get_lockless() to use arch_cmpxchg_double() and be > > done with it... > > I think the argument here has nothing to do with IPIs, but is more a > statement on memory ordering. The comment above the definition of ptep_get_lockless() claims: "it will not switch to a completely different present page without a TLB flush in between; something that we are blocking by holding interrupts off." > What we want to get is a non-torn load > of low/high, under some restricted rules. > > PTE writes should be ordered so that the present/not present bit is > properly: > > Zapping a PTE: > > write_low (not present) > wmb() > write_high (a) > wmb() > > Reestablish a PTE: > > write_high (b) > wmb() > write_low (present) > wmb() > > This ordering is necessary to make the TLB's atomic 64 bit load work > properly, otherwise the TLB could read a present entry with a bogus > other half! > > For ptep_get_lockless() we define non-torn as meaning the same as for the TLB: > > pre-zap low / high (present) > restablish low / high (b) (present) > any low / any high (not present) > > Other combinations are forbidden. > > The read side has a corresponding list of reads: > > read_low > rmb() > read_high > rmb() > read_low > > So, it seems plausible this could be OK based only on atomics (I did > not check that the present bit is properly placed in the right > low/high). Do you see a way the atomics don't work out? The race would be something like this, where A is one thread doing ptep_get_lockless() and B, C and D are other threads: <PTE initially points to address 0x0001000100010000> A: read ptep->pte_low, sees low address half 0x00010000 B: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet C: page fault installs a new PTE pointing to address 0x0001000200020000 A: read ptep->pte_high, sees high address half 0x00010002 C: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet D: page fault installs a new PTE pointing to address 0x0001000300010000 A: re-read ptep->pte_low, sees low address half 0x00010000 matching the first one A: returns physical address 0x000100020x00010000, which was never actually in the PTE So it's not a problem with the memory ordering, it's just that it's not possible to atomically read a 64-bit PTE with 32-bit reads when the PTE can completely change under you - and ptep_get_lockless() was written under the assumption that this can't happen because of TLB flush IPIs.