Re: Dirty/Access bits vs. page content

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Mon, 28 Apr 2014 11:25:40 +0200

On Sun, Apr 27, 2014 at 01:09:54PM -0700, Hugh Dickins wrote:
> On Sun, 27 Apr 2014, Hugh Dickins wrote:
> > 
> > But woke with a panic attack that we have overlooked the question
> > of how page reclaim's page_mapped() checks are serialized.
> > Perhaps this concern will evaporate with the morning dew,
> > perhaps it will not...
> 
> It was a real concern, but we happen to be rescued by the innocuous-
> looking is_page_cache_freeable() check at the beginning of pageout():
> which will deserve its own comment, but that can follow later.
> 
> My concern was with page reclaim's shrink_page_list() racing against
> munmap's or exit's (or madvise's) zap_pte_range() unmapping the page.
> 
> Once zap_pte_range() has cleared the pte from a vma, neither
> try_to_unmap() nor page_mkclean() will see that vma as containing
> the page, so neither will do its own flush TLB of the cpus involved,
> before proceeding to writepage.
> 
> Linus's patch (serialializing with ptlock) or my patch (serializing
> with i_mmap_mutex) both almost fix that, but it seemed not entirely:
> because try_to_unmap() is only called when page_mapped(), and
> page_mkclean() quits early without taking locks when !page_mapped().

Argh!! very good spotting that.

> So in the interval when zap_pte_range() has brought page_mapcount()
> down to 0, but not yet flushed TLB on all mapping cpus, it looked as
> if we still had a problem - neither try_to_unmap() nor page_mkclean()
> would take the lock either of us rely upon for serialization.
> 
> But pageout()'s preliminary is_page_cache_freeable() check makes
> it safe in the end: although page_mapcount() has gone down to 0,
> page_count() remains raised until the free_pages_and_swap_cache()
> after the TLB flush.
> 
> So I now believe we're safe after all with either patch, and happy
> for Linus to go ahead with his.

OK, so I'm just not seeing that atm. Will have another peek later,
hopefully when more fully awake.

> Peter, returning at last to your question of whether we could exempt
> shmem from the added overhead of either patch.  Until just now I
> thought not, because of the possibility that the shmem_writepage()
> could occur while one of the mm's cpus remote from zap_pte_range()
> cpu was still modifying the page.  But now that I see the role
> played by is_page_cache_freeable(), and of course the zapping end
> has never dropped its reference on the page before the TLB flush,
> however late that occurred, hmmm, maybe yes, shmem can be exempted.
> 
> But I'd prefer to dwell on that a bit longer: we can add that as
> an optimization later if it holds up to scrutiny.

For sure.. No need to rush that. And if a (performance) regression shows
up in the meantime, we immediately have a good test case too :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html