On Wed, Jul 19, 2017 at 11:14:17AM -0700, Nadav Amit wrote: > > > > Technically, DONTNEED is not required to zero-fill the data but in the > > case of Linux, it actually does matter because the stale entry is > > pointing to page that will be freed shortly. If a caller returns and > > uses a stale TLB entry to "reinitialise" the region then the writes may > > be lost. > > And although I didn???t check, it may have some implications on userfaultfd > which is often used with MADV_DONTNEED. > Potentially although I also consider it unlikely that a user of userfaultfd would be racing two madvises while copying out data. Then again, I do not know the userspace implementation of anything that uses userfaultfd. > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 9976852f1e1c..78bbe09e549e 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -497,6 +497,18 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > > unsigned long start, unsigned long end) > > { > > zap_page_range(vma, start, end - start); > > + > > + /* > > + * A parallel madvise operation could have unmapped PTEs and deferred > > + * a flush before this madvise returns. Guarantee the TLB is flushed > > + * so that an immediate read after madvise will return zero's for > > + * private anonymous mappings. File-backed shared mappings do not > > + * matter as they will either use a stale TLB entry or refault the > > + * data in the event of a race. > > + */ > > + if (vma_is_anonymous(vma)) > > + flush_tlb_range(vma, start, end); > > + > > return 0; > > } > > It will work but would in this case but would very often result in a > redundant TLB flush. It's one additional flush per anonymous VMA that is unmapped. Unfortunate but not excessive except maybe in the worst case of unmapping single-page VMAs. The larger the VMA, the less the relative cost. > I also think that we still don???t understand the extent > of the problem, based on the issues that keep coming out. In this case it > may be better to be defensive and not to try to avoid flushes too > aggressively (e.g., on non-anonymous VMAs). > Well, for file-backed or shared mappings, the data will either be clean (which means it's write-protected because of how dirty page tracking works) and can be discarded and retrived from disk and read safely from a stale TLB as long as it's flushed before the page is freed or it will be dirty in which case the TLB will be flushed before any IO starts. That's why I only checked for the anonymous case. > Here is what I have in mind (not tested). Based on whether mmap_sem is > acquired for write, exclusiveness is determined. If exclusiveness is not > maintained, a TLB flush is required. If I could use the owner field of rwsem > (when available), this can simplify the check whether the code is run > exclusively. > > Having said that, I still think that the whole batching scheme need to be > unified and rethought of. > This is a bit more overkill on the basis it covers file-backed or shared mappings but if you quantify it and see that it's not a problem, then I have no objection to the patch either. You may receive feedback that altering the API universally is undesirable and modify it to only update the madvise() caller that is directly affected due to two madvise calls beiing able to race and have one return to userspace with stale TLB entries pointing to anonymous memory that is about to be freed. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>