On Mon, Apr 21, 2008 at 04:35:47PM +0200, Peter Zijlstra wrote: > On Mon, 2008-04-21 at 16:26 +0300, Avi Kivity wrote: > > Peter Zijlstra wrote: > > > On Mon, 2008-04-21 at 15:00 +0300, Avi Kivity wrote: > > > > > >> Linus Torvalds wrote: > > >> > > >>> Finally, I don't think that comment is correct in the first place. It's > > >>> not that simple. The thing is, even *with* the memory barrier in place, we > > >>> may have: > > >>> > > >>> CPU#1 CPU#2 > > >>> ===== ===== > > >>> > > >>> fast_gup: > > >>> - read low word > > >>> > > >>> native_set_pte_present: > > >>> - set low word to 0 > > >>> - set high word to new value > > >>> > > >>> - read high word > > >>> > > >>> - set low word to new value > > >>> > > >>> and so you read a low word that is associated with a *different* high > > >>> word! Notice? > > >>> > > >>> So trivial memory ordering is _not_ enough. > > >>> > > >>> So I think the code literally needs to be something like this > > >>> > > >>> #ifdef CONFIG_X86_PAE > > >>> > > >>> static inline pte_t native_get_pte(pte_t *ptep) > > >>> { > > >>> pte_t pte; > > >>> > > >>> retry: > > >>> pte.pte_low = ptep->pte_low; > > >>> smp_rmb(); > > >>> pte.pte_high = ptep->pte_high; > > >>> smp_rmb(); > > >>> if (unlikely(pte.pte_low != ptep->pte_low) > > >>> goto retry; > > >>> return pte; > > >>> } > > >>> > > >>> > > >>> > > >> I think this is still broken. Suppose that after reading pte_high > > >> native_set_pte() is called again on another cpu, changing pte_low back > > >> to the original value (but with a different pte_high). You now have > > >> pte_low from second native_set_pte() but pte_high from the first > > >> native_set_pte(). > > >> > > > > > > I think the idea was that for user pages we only use set_pte_present() > > > which does the low=0 thing first. > > > > > > > Doesn't matter. The second native_set_pte() (or set_pte_present()) > > executes atomically: > > > > > > fast_gup: > > - read low word (l0) > > > > native_set_pte_present: > > - set low word to 0 > > - set high word to new value (h1) > > - set low word to new value (l1) > > > > > > - read high word (h1) > > > > native_set_pte_present: > > - set low word to 0 > > - set high word to new value (h2) > > - set low word to new value (l2) > > > > - re-read low word (l2) > > > > > > If l2 happens to be equal to l0, then the check succeeds and we have a > > splintered pte h1:l0. > > ok, so lets use cmpxchg8. That's horrible ;) Anyway guys you are missing the other side of the equation -- that whenever _PAGE_PRESENT is cleared, all CPUs where current->mm might be == mm have to have a tlb flush. And we're holding off tlb flushes in fast_gup, that's the whole reason why it all works. Indeed we do need Linus's loop, though, because I wasn't thinking of the teardown side when writing that comment it seems (teardowns under mmu_gather can and do set the pte to some arbitrary values before the IPI goes out -- but they will never contain _PAGE_PRESENT we can be sure). Linus's loop I will use for PAE. I'd love to know whether the hardware walker actually does an atomic 64-bit load or not, though. -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html