Re: Dirty/Access bits vs. page content

Hugh Dickins <hughd@xxxxxxxxxx> · Thu, 24 Apr 2014 11:40:17 -0700 (PDT)

On Thu, 24 Apr 2014, Peter Zijlstra wrote:
> On Wed, Apr 23, 2014 at 12:33:15PM -0700, Linus Torvalds wrote:
> > On Wed, Apr 23, 2014 at 11:41 AM, Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > Now I'm not sure how to fix Linus' patches. For all I care we could just
> > > rip out pte dirty bit handling for file mappings. However last time I
> > > suggested this you corrected me that tmpfs & ramfs need this. I assume this
> > > is still the case - however, given we unconditionally mark the page dirty
> > > for write faults, where exactly do we need this?
> > 
> > Honza, you're missing the important part: it does not matter one whit
> > that we unconditionally mark the page dirty, when we do it *early*,
> > and it can be then be marked clean before it's actually clean!
> > 
> > The problem is that page cleaning can clean the page when there are
> > still writers dirtying the page. Page table tear-down removes the
> > entry from the page tables, but it's still there in the TLB on other
> > CPU's. 
> 
> > So other CPU's are possibly writing to the page, when
> > clear_page_dirty_for_io() has marked it clean (because it didn't see
> > the page table entries that got torn down, and it hasn't seen the
> > dirty bit in the page yet).
> 
> So page_mkclean() does an rmap walk to mark the page RO, it does mm wide
> TLB invalidations while doing so.
> 
> zap_pte_range() only removes the rmap entry after it does the
> ptep_get_and_clear_full(), which doesn't do any TLB invalidates.
> 
> So as Linus says the page_mkclean() can actually miss a page, because
> it's already removed from rmap() but due to zap_pte_range() can still
> have active TLB entries.
> 
> So in order to fix this we'd have to delay the page_remove_rmap() as
> well, but that's tricky because rmap walkers cannot deal with
> in-progress teardown. Will need to think on this.

I've grown sceptical that Linus's approach can be made to work
safely with page_mkclean(), as it stands at present anyway.

I think that (in the exceptional case when a shared file pte_dirty has
been encountered, and this mm is active on other cpus) zap_pte_range()
needs to flush TLB on other cpus of this mm, just before its
pte_unmap_unlock(): then it respects the usual page_mkclean() protocol.

Or has that already been rejected earlier in the thread,
as too costly for some common case?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>