On Sun, Apr 27, 2014 at 12:20 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > OK, so I've been thinking and figured I either mis-understand how the > hardware works or don't understand how Linus' patch will actually fully > fix the issue. > > So what both try_to_unmap_one() and zap_pte_range() end up doing is > clearing the PTE entry and then flushing the TLBs. Right. And we can't do it in the other order, since if we flush the TLBs first, another cpu (or even this cpu, doing speculative memory accesses) could re-fill them between the flush and the clearing of the page table. So that's race #1, easily fixed by ordering the TLB flush after clearing the page table entry. We've never done this particular race wrong, afaik. It's _so_ obvious that it's the only one we've always gotten right. On to the ones we've historically gotten wrong: > However, that still leaves a window where there are remote TLB entries. > What if any of those remote entries cause a write (or have a dirty bit > cached) while we've already removed the PTE entry. So this is race #2: the race between clearing the entry and a TLB miss loading it and marking it accessed or dirty. That race is handled by the fact that the CPU does the accessed/dirty bit update as an atomic read-modify-write operation, and actually re-checks the PTE entry as it does so. So in theory a CPU could just remember what address it loaded the TLB entry from, and do a blind "set the dirty bit" with just an atomic "or" operation. In fact, for a while I thought that CPU's could do that, and the TLB flushing sequence would be: entry = atomic_xchg(pte, 0); flush_tlb(); entry |= *pte; so that we'd catch any races with the A/D bit getting set. It turns out no CPU actually does that, and I'm not sure we ever had that code sequence in the kernel (but some code archaeologist might go look). What CPU's actually do is simpler both for them and for us: instead of remembering where they loaded the TLB entry from, they re-walk the page table when they mark the TLB dirty, and if the page table entry is no longer present (or if it's non-writable), the store will fault instead of marking the TLB entry dirty. So race #2 doesn't need the above complex sequence, but it still *does* need that TLB entry to be loaded with an atomic exchange with zero (or at least with something that clears the present bit, zero obviously being the simplest such value). So a simple entry = atomic_xchg(pte, 0); is sufficient for this race (except we call it "ptep_get_and_clear()" ;) Of course, *If* a CPU were to remember the address it loaded the TLB entry from, then such a CPU might as well make the TLB be part of the cache-coherency domain, and then we wouldn't need to do any TLB flushing at all. I wish. > Will the hardware fault when it does a translation and needs to update > the dirty/access bits while the PTE entry is !present? Yes indeed, see above (but see how broken hardware _could_ work, which would be really painful for us). What we are fighting is race #3: the TLB happily exists on this or other CPU's, an dis _not_ getting updated (so no re-walk), but _is_ getting used. And we've actually had a fix for race #3 for a long time: the whole "don't free the pages until after the flush" is very much this issue. So it's not a new condition by any means (as far as I can tell, the mmu_gather infrastructure was introduced in 2.4.9.13, so 2001 - the exact commit predates even BK history). But this new issue is related to race #3, but purely in software: when we do the "set_page_dirty()" before doing the TLB flush, we need to protect against our cleaning that bit until after the flush. And we've now had three different ways to fix that race, one introducing a new race (my original two-patch series that delayed the set_page_dirty() the same way we delay the page freeing), one using a new lock entirely (Hugh latest patch - mapping->i_mmap_mutex isn't a new lock, but in this context it is), and one that extends the rules we already had in place for the single-PTE cases (do the set_page_dirty() and TLB flushing atomically wrt the page table lock, which makes it atomic wrt mkclean_one). And the reason I think I'll merge my patch rather than Hugh's (despite Hugh's being smaller) is exactly the fact that it doesn't really introduce any new locking rules - it just fixes the fact that we didn't really realize how important it was, and didn't follow the same rules as the single-pte cases did. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>