Re: TLB batching breaks MADV_DONTNEED

Mel Gorman <mgorman@xxxxxxx> · Wed, 19 Jul 2017 21:08:29 +0100

On Wed, Jul 19, 2017 at 11:14:17AM -0700, Nadav Amit wrote:
> > 
> > Technically, DONTNEED is not required to zero-fill the data but in the
> > case of Linux, it actually does matter because the stale entry is
> > pointing to page that will be freed shortly. If a caller returns and
> > uses a stale TLB entry to "reinitialise" the region then the writes may
> > be lost.
> 
> And although I didn???t check, it may have some implications on userfaultfd
> which is often used with MADV_DONTNEED.
> 

Potentially although I also consider it unlikely that a user of
userfaultfd would be racing two madvises while copying out data. Then
again, I do not know the userspace implementation of anything that uses
userfaultfd.

> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 9976852f1e1c..78bbe09e549e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -497,6 +497,18 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > 					unsigned long start, unsigned long end)
> > {
> > 	zap_page_range(vma, start, end - start);
> > +
> > +	/*
> > +	 * A parallel madvise operation could have unmapped PTEs and deferred
> > +	 * a flush before this madvise returns. Guarantee the TLB is flushed
> > +	 * so that an immediate read after madvise will return zero's for
> > +	 * private anonymous mappings. File-backed shared mappings do not
> > +	 * matter as they will either use a stale TLB entry or refault the
> > +	 * data in the event of a race.
> > +	 */
> > +	if (vma_is_anonymous(vma))
> > +		flush_tlb_range(vma, start, end);
> > +	
> > 	return 0;
> > }
> 
> It will work but would in this case but would very often result in a
> redundant TLB flush.

It's one additional flush per anonymous VMA that is unmapped.  Unfortunate
but not excessive except maybe in the worst case of unmapping single-page
VMAs. The larger the VMA, the less the relative cost.

> I also think that we still don???t understand the extent
> of the problem, based on the issues that keep coming out. In this case it
> may be better to be defensive and not to try to avoid flushes too
> aggressively (e.g., on non-anonymous VMAs).
> 

Well, for file-backed or shared mappings, the data will either be clean
(which means it's write-protected because of how dirty page tracking works)
and can be discarded and retrived from disk and read safely from a stale
TLB as long as it's flushed before the page is freed or it will be dirty
in which case the TLB will be flushed before any IO starts.  That's why
I only checked for the anonymous case.

> Here is what I have in mind (not tested). Based on whether mmap_sem is
> acquired for write, exclusiveness is determined. If exclusiveness is not
> maintained, a TLB flush is required. If I could use the owner field of rwsem
> (when available), this can simplify the check whether the code is run
> exclusively.
> 
> Having said that, I still think that the whole batching scheme need to be
> unified and rethought of.
> 

This is a bit more overkill on the basis it covers file-backed or shared
mappings but if you quantify it and see that it's not a problem, then I
have no objection to the patch either. You may receive feedback that
altering the API universally is undesirable and modify it to only update
the madvise() caller that is directly affected due to two madvise calls
beiing able to race and have one return to userspace with stale TLB
entries pointing to anonymous memory that is about to be freed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>