On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote: > In case prot_numa, we are under down_read(mmap_sem). It's critical > to not clear pmd intermittently to avoid race with MADV_DONTNEED > which is also under down_read(mmap_sem): > > CPU0: CPU1: > change_huge_pmd(prot_numa=1) > pmdp_huge_get_and_clear_notify() > madvise_dontneed() > zap_pmd_range() > pmd_trans_huge(*pmd) == 0 (without ptl) > // skip the pmd > set_pmd_at(); > // pmd is re-established > > The race makes MADV_DONTNEED miss the huge pmd and don't clear it > which may break userspace. > > Found by code analysis, never saw triggered. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > --- > mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++- > 1 file changed, 33 insertions(+), 1 deletion(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index e7ce73b2b208..bb2b3646bd78 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > if (prot_numa && pmd_protnone(*pmd)) > goto unlock; > > - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); > + /* > + * In case prot_numa, we are under down_read(mmap_sem). It's critical > + * to not clear pmd intermittently to avoid race with MADV_DONTNEED > + * which is also under down_read(mmap_sem): > + * > + * CPU0: CPU1: > + * change_huge_pmd(prot_numa=1) > + * pmdp_huge_get_and_clear_notify() > + * madvise_dontneed() > + * zap_pmd_range() > + * pmd_trans_huge(*pmd) == 0 (without ptl) > + * // skip the pmd > + * set_pmd_at(); > + * // pmd is re-established > + * > + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it > + * which may break userspace. > + * > + * pmdp_invalidate() is required to make sure we don't miss > + * dirty/young flags set by hardware. > + */ > + entry = *pmd; > + pmdp_invalidate(vma, addr, pmd); > + > + /* > + * Recover dirty/young flags. It relies on pmdp_invalidate to not > + * corrupt them. > + */ pmdp_invalidate() does: pmd_t entry = *pmdp; set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); so it's not atomic and if CPU sets dirty or accessed in the middle of this, they will be lost? But I don't see how the other invalidate caller __split_huge_pmd_locked() deals with this either. Andrea, any idea? Vlastimil > + if (pmd_dirty(*pmd)) > + entry = pmd_mkdirty(entry); > + if (pmd_young(*pmd)) > + entry = pmd_mkyoung(entry); > + > entry = pmd_modify(entry, newprot); > if (preserve_write) > entry = pmd_mk_savedwrite(entry); > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>