Re: ptep_get_lockless() on 32-bit x86/mips/sh looks wrong

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 6 Oct 2022 12:44:23 -0300

On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote:
> ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH:
> 
> pte_t pte;
> do {
>   pte.pte_low = ptep->pte_low;
>   smp_rmb();
>   pte.pte_high = ptep->pte_high;
>   smp_rmb();
> } while (unlikely(pte.pte_low != ptep->pte_low));
> 
> It has a comment above it that argues that this is correct because:
> 1. A present PTE can't become non-present and then become a present
> PTE pointing to another page without a TLB flush in between.
> 2. TLB flushes involve IPIs.
> 
> As far as I can tell, in particular on x86, _both_ of those
> assumptions are false; perhaps on mips and sh only one of them is?
> 
> Number 2 is straightforward: X86 can run under hypervisors, and when
> it runs under hypervisors, the MMU paravirtualization code (including
> the KVM version) can implement remote TLB flushes without IPIs.
> 
> Number 1 is gnarlier, because breaking that assumption implies that
> there can be a situation where different threads see different memory
> at the same virtual address because their TLBs are incoherent. But as
> far as I know, it can happen when MADV_DONTNEED races with an
> anonymous page fault, because zap_pte_range() does not always flush
> stale TLB entries before dropping the page table lock. I think that's
> probably fine, since it's a "garbage in, garbage out" kind of
> situation - but if a concurrent GUP-fast can then theoretically end up
> returning a completely unrelated page, that's bad.
> 
> 
> Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't
> just change ptep_get_lockless() to use arch_cmpxchg_double() and be
> done with it...

I think the argument here has nothing to do with IPIs, but is more a
statement on memory ordering. What we want to get is a non-torn load
of low/high, under some restricted rules.

PTE writes should be ordered so that the present/not present bit is
properly:

Zapping a PTE:

write_low (not present)
wmb()
write_high (a)
wmb()

Reestablish a PTE:

write_high (b)
wmb()
write_low (present)
wmb()

This ordering is necessary to make the TLB's atomic 64 bit load work
properly, otherwise the TLB could read a present entry with a bogus
other half!

For ptep_get_lockless() we define non-torn as meaning the same as for the TLB:

     pre-zap low / high (present)
  restablish low / high (b) (present)
         any low / any high (not present)

Other combinations are forbidden.

The read side has a corresponding list of reads:

read_low
rmb()
read_high
rmb()
read_low

So, it seems plausible this could be OK based only on atomics (I did
not check that the present bit is properly placed in the right
low/high). Do you see a way the atomics don't work out?

Jason