Re: TLB batching breaks MADV_DONTNEED

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 18, 2017 at 10:05:23PM -0700, Nadav Amit wrote:
> Something seems to be really wrong with all these TLB flush batching
> mechanisms that are all around kernel. Here is another example, which was
> not addressed by the recently submitted patches.
> 
> Consider what happens when two MADV_DONTNEED run concurrently. According to
> the man page "After a successful MADV_DONTNEED operation ??? subsequent
> accesses of pages in the range will succeed, but will result in ???
> zero-fill-on-demand pages for anonymous private mappings.???
> 
> However, the test below, which does MADV_DONTNEED in two threads, reads ???8???
> and not ???0??? when reading the memory following MADV_DONTNEED. It happens
> since one of the threads clears the PTE, but defers the TLB flush for some
> time (until it finishes changing 16k PTEs). The main thread sees the PTE
> already non-present and does not flush the TLB.
> 
> I think there is a need for a batching scheme that considers whether
> mmap_sem is taken for write/read/nothing and the change to the PTE.
> Unfortunately, I do not have the time to do it right now.
> 
> Am I missing something? Thoughts?
> 

You're right that in this case, there will be a short window when the old
anonymous data is still available. Non-anonymous doesn't matter in this case
as the if the data is unmapped but available from a stale TLB entry, all it
means is that there is a delay in refetching the data from backing storage.

Technically, DONTNEED is not required to zero-fill the data but in the
case of Linux, it actually does matter because the stale entry is
pointing to page that will be freed shortly. If a caller returns and
uses a stale TLB entry to "reinitialise" the region then the writes may
be lost.

This is independent of the reclaim batching of flushes and specific to
how madvise uses zap_page_range.

The most straight-forward but overkill solution would be to take mmap_sem
for write for madvise. That would have wide-ranging consequences and likely
to be rejected.

A more reasonable solution would be to always flush the TLB range being
madvised when the VMA is a private anonymous mapping to guarantee that
a zero-fill-on-demand region exists. Other mappings do not need special
protection as a parallel access will either use a stale TLB (no permission
change so no problem) or refault the data. Special casing based on
mmap_sem does not make much sense but is also unnecessary.

Something like this completely untested patch that would point in the
general direction if a case can be found where this should be fixed. It
could be optimised to only flush the local TLB but it's probably not worth
the complexity.

diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..78bbe09e549e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -497,6 +497,18 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
 	zap_page_range(vma, start, end - start);
+
+	/*
+	 * A parallel madvise operation could have unmapped PTEs and deferred
+	 * a flush before this madvise returns. Guarantee the TLB is flushed
+	 * so that an immediate read after madvise will return zero's for
+	 * private anonymous mappings. File-backed shared mappings do not
+	 * matter as they will either use a stale TLB entry or refault the
+	 * data in the event of a race.
+	 */
+	if (vma_is_anonymous(vma))
+		flush_tlb_range(vma, start, end);
+	
 	return 0;
 }
 


-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]
  Powered by Linux