On Wed, 23 Apr 2014, Jan Kara wrote: > On Tue 22-04-14 20:08:59, Hugh Dickins wrote: > > On Tue, 22 Apr 2014, Linus Torvalds wrote: > > > On Tue, Apr 22, 2014 at 12:54 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > That said, Dave Hansen did report a BUG_ON() in > > > mpage_prepare_extent_to_map(). His line number was odd, but I assume > > > it's this one: > > > > > > BUG_ON(PageWriteback(page)); > > > > > > which may be indicative of some oddity here wrt the dirty bit. > > > > Whereas later mail from Dave showed it to be the > > BUG_ON(!PagePrivate(page)); > > in page_buffers() from fs/ext4/inode.c mpage_prepare_extent_to_map(). > > But still presumably some kind of fallout from your patches. > > > > Once upon a time there was a page_has_buffers() check in there, > > but Honza decided that's nowadays unnecessary in f8bec37037ac > > "ext4: dirty page has always buffers attached". Cc'ed, > > he may very well have some good ideas. > > > > Reading that commit reminded me of how we actually don't expect that > > set_page_dirty() in zap_pte_range() to do anything at all on the usual > > mapping_cap_account_dirty()/page_mkwrite() filesystems, do we? Or do we? > Yes, for shared file mappings we (as in filesystems implementing > page_mkwrite() handler) expect a page is writeably mmapped iff the page is > dirty. So in particular we don't expect set_page_dirty() in zap_pte_range() > to do anything because if the pte has dirty bit set, we are tearing down a > writeable mapping of the page and thus the page should be already dirty. > > Now the devil is in synchronization of different places where transitions > from/to writeably-mapped state happen. In the fault path (do_wp_page()) > where transition to writeably-mapped happens we hold page lock while > calling set_page_dirty(). In the writeout path (clear_page_dirty_for_io()) > where we transition from writeably-mapped we hold the page lock as well > while calling page_mkclean() and possibly set_page_dirty(). So these two > places are nicely serialized. However zap_pte_range() doesn't hold page > lock so it can race with the previous two places. Before Linus' patches we > called set_page_dirty() under pte lock in zap_pte_range() and also before > decrementing page->mapcount. So if zap_pte_range() raced with > clear_page_dirty_for_io() we were guaranteed that by the time > clear_page_dirty_for_io() returns, pte dirty bit is cleared and > set_page_dirty() was called (either from clear_page_dirty_for_io() or from > zap_pte_range()). > > However with Linus' patches set_page_dirty() from zap_pte_range() gets > called after decremeting page->mapcount so page_mkclean() won't event try > to walk rmap. And even if page_mkclean() did walk the rmap, zap_pte_range() > calls set_page_dirty() after dropping pte lock so it can get called long > after page_mkclean() (and clear_page_dirty_for_io()) has returned. Right, thanks a lot for fleshing that out. > > Now I'm not sure how to fix Linus' patches. For all I care we could just > rip out pte dirty bit handling for file mappings. However last time I > suggested this you corrected me that tmpfs & ramfs need this. I assume this > is still the case - however, given we unconditionally mark the page dirty > for write faults, where exactly do we need this? Good, Linus has already replied to you on this this: you appear to be suggesting that there would be no issue, and Linus's patches would not be needed at all, if only tmpfs and ramfs played by the others' rules. But (sadly) I don't think that's so: just because zap_pte_range()'s current "if (pte_dirty) set_page_dirty" does nothing on most filesystems, does not imply that nothing needs to be done on most filesystems, now that we're alert to the delayed TLB flushing issue. Just to answer your (interesting but irrelevant!) question about tmpfs and ramfs: their issue is with read faults which bring in a zeroed page, with page and pte not marked dirty. If userspace modifies that page, the pte_dirty needs to be propagated through to PageDirty, to prevent page reclaim from simply freeing the apparently clean page. (To be honest, I haven't checked the ramfs case recently: perhaps its pages are marked unevictable, and never even reach the code which might free them.) Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html