On Thu, Oct 27, 2022 at 12:08 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > So I thought the general rule was that if you modify a PTE and have not > unmapped things -- IOW, there's actual concurrency possible on the > thing, then the TLB invalidate needs to happen under pte_lock, since > that is what controls concurrency at the pte level. Yeah, I think we should really think about the TLB issue and make the rules very clear. A lot of our thinking behind "fix TLB races" have been about "use-after-free wrt TLB on another CPU", and the design in zap_page_ranges() is almost entirely about that: making sure that the TLB flush happens before the underlying pages are free'd. I say "almost entirely", because you can see the effects of the *other* kind of race in if (!PageAnon(page)) { if (pte_dirty(ptent)) { force_flush = 1; set_page_dirty(page); } where it isn't about the lifetime of the page, but about the state coherency wrt other users. But I think even that code is quite questionable, and we really should write down the rules much more strictly somewhere. For example, wrt that pte_dirty() causing force_flush, I'd love to have some core set of rules that would explain (a) why it is only relevant for shared pages (b) do even shared pages need it for the "fullmm" case when we tear down the whole VM? (c) what are the force_flush interactions wrt holding the mmap lock for writing vs reading? In the above, I think (b)/(c) are related - I suspect "fullmm" is mostly equivalent to "mmap held for writing", in that it guarantees that there should be no concurrent faults or other active sw ops on the table. But "fullmm" is probably even stronger than "mmap write-lock" in that it should also mean "no other CPU can be actively using this" - either for hardware page table walking, or for GUP. Both both have the vmscan code and rmap that can still get to the pages - and the vmscan code clearly only cares about the page table lock. But the vmscan code itself also shouldn't care about some stale TLB value, so ... > As it stands MADV_DONTNEED seems to blatatly violate that general rule. So let's talk about exactly what and why the TLB would need to be flushed before dropping the page table lock. For example, MADV_DONTNEED does this all with just the mmap lock held for reading, so we *unless* we have that 'force_flush', we can (a) have another CPU continue to use the old stale TLB entry for quite a while (b) yet another CPU (that didn't have a TLB entry, or wanted to write to a read-only one ) could take a page fault, and install a *new* PTE entry in the same slot, all at the same time. Now, that's clearly *very* confusing. But being confusing may not mean "wrong" - we're still delaying the free of the old entry, so there's no use-after-free. The biggest problem I can see is that this means that user space memory ordering might be screwed up: the CPU in (a) will see not just an old TLB entry, but simply old *data*, when the CPU in (b) may be writing to that same address with new data. So I think we clearly do *NOT* serialize as much as we could for MADV_DONTNEED, and I think the above is a real semantic result of that. But equally arguably if you do this kind of unserialized MADV_DONTNEED in user space, that's a case of "you asked for the breakage - you get to keep both pieces". So is that ever an actual problem? If the worst issue is that some CPU's may see old ("pre-DONTNEED") data, while other CPU's then see new data while the MADV_DONTNEED is executing, I think maybe *THAT* part is fine. But because it's so very confusing, maybe we have other problems in this area, which is why I think it would be really good to have an actual set of real hard documented rules about this all, and exactly when we need to flush TLBs synchronously with the page table lock, and when we do not. Anybody willing to try to write up the rules (and have each rule document *why* it's a rule - not just "by fiat", but an actual "these are the rules and this is *why* they are the rules"). Because right now I think all of our rules are almost entirely just encoded in the code, with a couple of comments, and a few people who just remember why we do what we do. Linus