Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 27 Oct 2022 11:13:55 -0700

On Thu, Oct 27, 2022 at 12:08 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> So I thought the general rule was that if you modify a PTE and have not
> unmapped things -- IOW, there's actual concurrency possible on the
> thing, then the TLB invalidate needs to happen under pte_lock, since
> that is what controls concurrency at the pte level.

Yeah, I think we should really think about the TLB issue and make the
rules very clear.

A lot of our thinking behind "fix TLB races" have been about
"use-after-free wrt TLB on another CPU", and the design in
zap_page_ranges() is almost entirely about that: making sure that the
TLB flush happens before the underlying pages are free'd.

I say "almost entirely", because you can see the effects of the
*other* kind of race in

                        if (!PageAnon(page)) {
                                if (pte_dirty(ptent)) {
                                        force_flush = 1;
                                        set_page_dirty(page);
                                }

where it isn't about the lifetime of the page, but about the state
coherency wrt other users.

But I think even that code is quite questionable, and we really should
write down the rules much more strictly somewhere.

For example, wrt that pte_dirty() causing force_flush, I'd love to
have some core set of rules that would explain

 (a) why it is only relevant for shared pages

 (b) do even shared pages need it for the "fullmm" case when we tear
down the whole VM?

 (c) what are the force_flush interactions wrt holding the mmap lock
for writing vs reading?

In the above, I think (b)/(c) are related - I suspect "fullmm" is
mostly equivalent to "mmap held for writing", in that it guarantees
that there should be no concurrent faults or other active sw ops on
the table.

But "fullmm" is probably even stronger than "mmap write-lock" in that
it should also mean "no other CPU can be actively using this" - either
for hardware page table walking, or for GUP.

Both both have the vmscan code and rmap that can still get to the
pages - and the vmscan code clearly only cares about the page table
lock. But the vmscan code itself also shouldn't care about some stale
TLB value, so ...

> As it stands MADV_DONTNEED seems to blatatly violate that general rule.

So let's talk about exactly what and why the TLB would need to be
flushed before dropping the page table lock.

For example, MADV_DONTNEED does this all with just the mmap lock held
for reading, so we *unless* we have that 'force_flush', we can

 (a) have another CPU continue to use the old stale TLB entry for quite a while

 (b) yet another CPU (that didn't have a TLB entry, or wanted to write
to a read-only one ) could take a page fault, and install a *new* PTE
entry in the same slot, all at the same time.

Now, that's clearly *very* confusing. But being confusing may not mean
"wrong" - we're still delaying the free of the old entry, so there's
no use-after-free.

The biggest problem I can see is that this means that user space
memory ordering might be screwed up: the CPU in (a) will see not just
an old TLB entry, but simply old *data*, when the CPU in (b) may be
writing to that same address with new data.

So I think we clearly do *NOT* serialize as much as we could for
MADV_DONTNEED, and I think the above is a real semantic result of
that.

But equally arguably if you do this kind of unserialized MADV_DONTNEED
in user space, that's a case of "you asked for the breakage - you get
to keep both pieces".

So is that ever an actual problem? If the worst issue is that some
CPU's may see old ("pre-DONTNEED") data, while other CPU's then see
new data while the MADV_DONTNEED is executing, I think maybe *THAT*
part is fine.

But because it's so very confusing, maybe we have other problems in
this area, which is why I think it would be really good to have an
actual set of real hard documented rules about this all, and exactly
when we need to flush TLBs synchronously with the page table lock, and
when we do not.

Anybody willing to try to write up the rules (and have each rule
document *why* it's a rule - not just "by fiat", but an actual "these
are the rules and this is *why* they are the rules").

Because right now I think all of our rules are almost entirely just
encoded in the code, with a couple of comments, and a few people who
just remember why we do what we do.

                   Linus