On Sun, Oct 30, 2022 at 11:19 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > And we'd _like_ to do the TLB flush before the remove_rmap(), but we > *really* don't want to do that for every page. Hmm. I have yet another crazy idea. We could keep the current placement of the TLB flush, to just before we drop the page table lock. And we could do all the things we do in 'page_remove_rmap()' right now *except* for the mapcount stuff. And only move the mapcount code to the page freeing stage. Because all the rmap() walk synchronization really needs is that 'page->_mapcount' is still elevated, and if it is it will serialize with the page table lock. And it turns out that 'page_remove_rmap()' already treats the case we care about differently, and all it does is lock_page_memcg(page); if (!PageAnon(page)) { page_remove_file_rmap(page, compound); goto out; } ... out: unlock_page_memcg(page); munlock_vma_page(page, vma, compound); for that case. And that 'page_remove_file_rmap()' is literally the code that modifies the _mapcount. Annoyingly, this is all complicated by that 'compound' argument, but that's always false in that zap_page_range() case. So what we *could* do, is make a new version of page_remove_rmap(), which is specialized for this case: no 'compound' argument (always false), and doesn't call 'page_remove_file_rmap()', because we'll do that for the !PageAnon(page) case later after the TLB flush. That would keep the existing TLB flush logic, keep the existing 'mark page dirty' and would just make sure that 'folio_mkclean()' ends up being serialized with the TLB flush simply because it will take the page table lock because we delay the '._mapcount' update until afterwards. Annoyingly, the organization of 'page_remove_rmap()' is a bit ugly, and we have several other callers that want the existing logic, so while the above sounds conceptually simple, I think the patch would be a bit messy. Linus