On Mon, Jul 27, 2020 at 3:43 PM Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote: > > With the commit ("mm: drop mmap_sem before calling balance_dirty_pages() > in write fault") the retried fault may happen much more frequently than > before since it would drop mmap lock as long as dirty throttling happens. Sure. And that probably explains why it shows up as a regression. That said, the fact that it showed up as a regression that clearly probably means that the whole spurious TLB flush has always been a low-grade problem, and the extra retries just made it much more noticeable because now there was a change to it. The fact that we have that (very questionable) optimization to only do it for writes kind of reinforces that notion - it has happened before, it's just never been fixed properly, and it's just never been noticeable on most machines because this is all a no-op on x86. I think Catalin's patch - with some way to fix the problem with KVM - is the way to go. That said, testing FAULT_FLAG_TRIED and suppressing the spurious TLB fill for that case is certainly always safe. At worst, we'll take another fault, and then do the TLB flush at _that_ point when not retrying. So it's the FAULT_FLAG_WRITE test that I think is bogus, or at least should be protected by some architecture decision (with a comment about why it's ok for that architecture, ie the ARM kind of "old PTE's will never be in the TLB, and if it's not a write fault we know it doesn't depend on the dirty bit either") Of course, it may be that on every architecture that requires SW accessed bits, the "old PTE's will never be in the TLB is true". Except I think I know at least one architecture where that isn't true. On alpha, the way the acccessed bit works is exactly the same way the dirty bit works - except it's done for reads, instead of writes. So on at least one architecture, access faults and dirty faults are 100% equivalent, just using read/write bits respectively. Of course, alpha doesn't really matter any more. But it's an example of an architecture where "old" does not necessarily mean "cannot be in the TLB", and where testing for FAULT_FLAG_WRITE looks buggy. Again: I think in practice, it's really *really* hard to hit the problem with accessed bits, unlike dirty bits. Normally, PTE's are all instantiated young if they are in the TLB. You have to kind of work at it to get an old PTE _and_ then hit the "now access it exactly at the same time from two different CPU's, and watch one CPU keep taking page faults forever because it never flushes its TLB entry". Of course, it is so long since I worked with alpha that maybe there's some other reason this can't happen. Like "PAL-code always flushes the TLB entry of the faulting address". Which all hardware should do, dammit. It's all kinds of stupid to cache a faulting TLB entry. The fault is thousands of times more expensive than a reload would be, even if it were intentional and done repeatedly (which sounds like an insane thing to optimize for anyway). So one way to fix this problem would be to just specify that "every pagefault handler _must_ flush the local-CPU TLB entry that the fault happened for if the architecture doesn't already do that in hardware or microcode". And then we'd just remove the spurious TLB flush code entirely. Linus