On Fri, Oct 28, 2022 at 4:57 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > The problem is in the following code of zap_pte_range(): > > if (!PageAnon(page)) { > if (pte_dirty(ptent)) { > force_flush = 1; > set_page_dirty(page); > } > … > } > page_remove_rmap(page, vma, false); > > Once we remove the rmap, rmap_walk() would not acquire the page-table lock > anymore. As a result, nothing prevents the kernel from performing writeback > and cleaning the page-struct dirty-bit before the TLB is actually flushed. Hah. The original reason for force_flush there was similar, with a race wrt page_mkclean() because this code doesn't take the page lock that would normally serialize things, because the page lock is so painful and ends up having nasty nasty interactions with slow IO operations. So all the page-dirty handling really *wants* to use the page lock, and for the IO side (writeback) that ends up being acceptable and works well, but from that "serialize VM state" it's horrendous. So instead the code intentionally serialized on the rmap data structures which page_mkclean() also walks, and as you point out, that's broken. It's not broken at the point where we do set_page_dirty(), but it *comes* broken when we drop the rmap, and the problem is exactly that "we still have the dirty bit hidden in the TLB state" issue that you pinpoint. I think the proper fix (or at least _a_ proper fix) would be to actually carry the dirty bit along to the __tlb_remove_page() point, and actually treat it exactly the same way as the page pointer itself - set the page dirty after the TLB flush, the same way we can free the page after the TLB flush. We could easiy hide said dirty bit in the low bits of the "batch->pages[]" array or something like that. We'd just have to add the 'dirty' argument to __tlb_remove_page_size() and friends. Hmm? Your idea of "do the page_remove_rmap() late instead" would also work, but the reason I think just squirrelling away the dirty bit is the "proper" fix is that it would get rid of the whole need for 'force_flush' in this area entirely. So we'd not only fix that race you noticed, we'd actually do so and reduce the number of TLB flushes too. I don't know. Maybe I'm missing something fundamental, and my idea is just stupid. Linus