Re: Dirty/Access bits vs. page content

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 27 Apr 2014 09:21:45 -0700

On Sun, Apr 27, 2014 at 12:20 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> OK, so I've been thinking and figured I either mis-understand how the
> hardware works or don't understand how Linus' patch will actually fully
> fix the issue.
>
> So what both try_to_unmap_one() and zap_pte_range() end up doing is
> clearing the PTE entry and then flushing the TLBs.

Right. And we can't do it in the other order, since if we flush the
TLBs first, another cpu (or even this cpu, doing speculative memory
accesses) could re-fill them between the flush and the clearing of the
page table.

So that's race #1, easily fixed by ordering the TLB flush after
clearing the page table entry. We've never done this particular race
wrong, afaik. It's _so_ obvious that it's the only one we've always
gotten right.

On to the ones we've historically gotten wrong:

> However, that still leaves a window where there are remote TLB entries.
> What if any of those remote entries cause a write (or have a dirty bit
> cached) while we've already removed the PTE entry.

So this is race #2: the race between clearing the entry and a TLB miss
loading it and marking it accessed or dirty.

That race is handled by the fact that the CPU does the accessed/dirty
bit update as an atomic read-modify-write operation, and actually
re-checks the PTE entry as it does so.

So in theory a CPU could just remember what address it loaded the TLB
entry from, and do a blind "set the dirty bit" with just an atomic
"or" operation. In fact, for a while I thought that CPU's could do
that, and the TLB flushing sequence would be:

    entry = atomic_xchg(pte, 0);
    flush_tlb();
    entry |= *pte;

so that we'd catch any races with the A/D bit getting set.

It turns out no CPU actually does that, and I'm not sure we ever had
that code sequence in the kernel (but some code archaeologist might go
look).

What CPU's actually do is simpler both for them and for us: instead of
remembering where they loaded the TLB entry from, they re-walk the
page table when they mark the TLB dirty, and if the page table entry
is no longer present (or if it's non-writable), the store will fault
instead of marking the TLB entry dirty.

So race #2 doesn't need the above complex sequence, but it still
*does* need that TLB entry to be loaded with an atomic exchange with
zero (or at least with something that clears the present bit, zero
obviously being the simplest such value). So a simple

    entry = atomic_xchg(pte, 0);

is sufficient for this race (except we call it "ptep_get_and_clear()" ;)

Of course, *If* a CPU were to remember the address it loaded the TLB
entry from, then such a CPU might as well make the TLB be part of the
cache-coherency domain, and then we wouldn't need to do any TLB
flushing at all. I wish.

> Will the hardware fault when it does a translation and needs to update
> the dirty/access bits while the PTE entry is !present?

Yes indeed, see above (but see how broken hardware _could_ work, which
would be really painful for us).

What we are fighting is race #3: the TLB happily exists on this or
other CPU's, an dis _not_ getting updated (so no re-walk), but _is_
getting used.

And we've actually had a fix for race #3 for a long time: the whole
"don't free the pages until after the flush" is very much this issue.
So it's not a new condition by any means (as far as I can tell, the
mmu_gather infrastructure was introduced in 2.4.9.13, so 2001 - the
exact commit predates even BK history).

But this new issue is related to race #3, but purely in software: when
we do the "set_page_dirty()" before doing the TLB flush, we need to
protect against our cleaning that bit until after the flush.

And we've now had three different ways to fix that race, one
introducing a new race (my original two-patch series that delayed the
set_page_dirty() the same way we delay the page freeing), one using a
new lock entirely (Hugh latest patch - mapping->i_mmap_mutex isn't a
new lock, but in this context it is), and one that extends the rules
we already had in place for the single-PTE cases (do the
set_page_dirty() and TLB flushing atomically wrt the page table lock,
which makes it atomic wrt mkclean_one).

And the reason I think I'll merge my patch rather than Hugh's (despite
Hugh's being smaller) is exactly the fact that it doesn't really
introduce any new locking rules - it just fixes the fact that we
didn't really realize how important it was, and didn't follow the same
rules as the single-pte cases did.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html