On Thu, 24 Apr 2014, Peter Zijlstra wrote: > On Wed, Apr 23, 2014 at 12:33:15PM -0700, Linus Torvalds wrote: > > On Wed, Apr 23, 2014 at 11:41 AM, Jan Kara <jack@xxxxxxx> wrote: > > > > > > Now I'm not sure how to fix Linus' patches. For all I care we could just > > > rip out pte dirty bit handling for file mappings. However last time I > > > suggested this you corrected me that tmpfs & ramfs need this. I assume this > > > is still the case - however, given we unconditionally mark the page dirty > > > for write faults, where exactly do we need this? > > > > Honza, you're missing the important part: it does not matter one whit > > that we unconditionally mark the page dirty, when we do it *early*, > > and it can be then be marked clean before it's actually clean! > > > > The problem is that page cleaning can clean the page when there are > > still writers dirtying the page. Page table tear-down removes the > > entry from the page tables, but it's still there in the TLB on other > > CPU's. > > > So other CPU's are possibly writing to the page, when > > clear_page_dirty_for_io() has marked it clean (because it didn't see > > the page table entries that got torn down, and it hasn't seen the > > dirty bit in the page yet). > > So page_mkclean() does an rmap walk to mark the page RO, it does mm wide > TLB invalidations while doing so. > > zap_pte_range() only removes the rmap entry after it does the > ptep_get_and_clear_full(), which doesn't do any TLB invalidates. > > So as Linus says the page_mkclean() can actually miss a page, because > it's already removed from rmap() but due to zap_pte_range() can still > have active TLB entries. > > So in order to fix this we'd have to delay the page_remove_rmap() as > well, but that's tricky because rmap walkers cannot deal with > in-progress teardown. Will need to think on this. I've grown sceptical that Linus's approach can be made to work safely with page_mkclean(), as it stands at present anyway. I think that (in the exceptional case when a shared file pte_dirty has been encountered, and this mm is active on other cpus) zap_pte_range() needs to flush TLB on other cpus of this mm, just before its pte_unmap_unlock(): then it respects the usual page_mkclean() protocol. Or has that already been rejected earlier in the thread, as too costly for some common case? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>