On Thu, Jan 7, 2021 at 3:48 PM Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > > The alternate fix remains "make sure we always flush the TLB before > > releasing the page table lock, and make COW do the copy under the page > > table lock". But I really would prefer to just have this code follow > The copy under PT lock isn't enough. > > Flush TLB before releasing is enough of course. Right. That's why I said "and". You need both, afaik. But if we just do the mmap write lock, you need neither - then you just need to flush before you release the write lock. > Note also the patch in 2/2 patch that I sent: Yes, yes, and that's what I'm objecting to. All these stupid games with "flush_pending(" counts are complete garbage. They all come from the fact that this code doesn't hold the right lock. I don't understand you: in one breath you seem to say "yes, taking the write lock is the right thing to do", and then in the next one you point to this patch that adds all this garbage *because* it's not holding the write lock. All of those "tlb_flush_pending" things are wrong. They should not exist. The code in clear_refs_write() should hold the mmap_sem for writing, and do the TLB flush before it releases the mmap sem, and then it *cannot* race with the page faults. See what I'm saying? I refuse to apply your patch 2/2, because it all seems entirely wrong. What's doubly ludicrous about that is that the coide already _took_ the mmap_sem for writing, and spent extra cycles to downgrade it - INCORRECTLY - to a read-lock. And as far as I can tell, it doesn't even do anything expensive inside that (now downgraded) region, so the downgrading was (a) buggy (b) slower than just keeping the lock the way it had and (b) is because it had already done the expensive part (which was to get the lock in the first place). Just as an example, the whole "Rollback wrprotect_tlb_flush_pending" is all because it got the lock - again wrongly - as a read-lock initially, then it says "oh, I need to get a write lock", releases it, re-takes it as a write lock, does a tiny amount of work, and then - again incorrectly - downgrades it to a read-lock. To make it even worse (if that is possible) it actually had YET ANOTHER case - that CLEAR_REFS_MM_HIWATER_RSS - where it took the mmap sem for writing, did its thing, and then released it. So there's like *four* different locking mistakes in that single function. And it's not even an important function to begin with. It shgould just have done a single mmap_write_lock_killable(mm); ... mmap_write_unlock(mm); around the whole thing, instead of _any_ of that crazy stuff. That code is WRONG. And your PATCH 2/2 makes that insane code EVEN WORSE. Why the heck is that magic fs/proc/ interface allowed to get VM internals so wrong, and make things so much worse? Can you not see why I'm arguing with you? Please. Why is the correct patch not the attached one (apart from the obvious fact that I haven't tested it and maybe just screwed up completely - but you get the idea)? Linus
Attachment:
patch
Description: Binary data