On Wed, Apr 23, 2014 at 12:33:15PM -0700, Linus Torvalds wrote: > On Wed, Apr 23, 2014 at 11:41 AM, Jan Kara <jack@xxxxxxx> wrote: > > > > Now I'm not sure how to fix Linus' patches. For all I care we could just > > rip out pte dirty bit handling for file mappings. However last time I > > suggested this you corrected me that tmpfs & ramfs need this. I assume this > > is still the case - however, given we unconditionally mark the page dirty > > for write faults, where exactly do we need this? > > Honza, you're missing the important part: it does not matter one whit > that we unconditionally mark the page dirty, when we do it *early*, > and it can be then be marked clean before it's actually clean! > > The problem is that page cleaning can clean the page when there are > still writers dirtying the page. Page table tear-down removes the > entry from the page tables, but it's still there in the TLB on other > CPU's. > So other CPU's are possibly writing to the page, when > clear_page_dirty_for_io() has marked it clean (because it didn't see > the page table entries that got torn down, and it hasn't seen the > dirty bit in the page yet). So page_mkclean() does an rmap walk to mark the page RO, it does mm wide TLB invalidations while doing so. zap_pte_range() only removes the rmap entry after it does the ptep_get_and_clear_full(), which doesn't do any TLB invalidates. So as Linus says the page_mkclean() can actually miss a page, because it's already removed from rmap() but due to zap_pte_range() can still have active TLB entries. So in order to fix this we'd have to delay the page_remove_rmap() as well, but that's tricky because rmap walkers cannot deal with in-progress teardown. Will need to think on this. -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html