On 04/12/2017 03:33 PM, Vlastimil Babka wrote: > On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote: >> In case prot_numa, we are under down_read(mmap_sem). It's critical >> to not clear pmd intermittently to avoid race with MADV_DONTNEED >> which is also under down_read(mmap_sem): >> >> CPU0: CPU1: >> change_huge_pmd(prot_numa=1) >> pmdp_huge_get_and_clear_notify() >> madvise_dontneed() >> zap_pmd_range() >> pmd_trans_huge(*pmd) == 0 (without ptl) >> // skip the pmd >> set_pmd_at(); >> // pmd is re-established >> >> The race makes MADV_DONTNEED miss the huge pmd and don't clear it >> which may break userspace. >> >> Found by code analysis, never saw triggered. >> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> >> --- >> mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++- >> 1 file changed, 33 insertions(+), 1 deletion(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index e7ce73b2b208..bb2b3646bd78 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> if (prot_numa && pmd_protnone(*pmd)) >> goto unlock; >> >> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); >> + /* >> + * In case prot_numa, we are under down_read(mmap_sem). It's critical >> + * to not clear pmd intermittently to avoid race with MADV_DONTNEED >> + * which is also under down_read(mmap_sem): >> + * >> + * CPU0: CPU1: >> + * change_huge_pmd(prot_numa=1) >> + * pmdp_huge_get_and_clear_notify() >> + * madvise_dontneed() >> + * zap_pmd_range() >> + * pmd_trans_huge(*pmd) == 0 (without ptl) >> + * // skip the pmd >> + * set_pmd_at(); >> + * // pmd is re-established >> + * >> + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it >> + * which may break userspace. >> + * >> + * pmdp_invalidate() is required to make sure we don't miss >> + * dirty/young flags set by hardware. >> + */ >> + entry = *pmd; >> + pmdp_invalidate(vma, addr, pmd); >> + >> + /* >> + * Recover dirty/young flags. It relies on pmdp_invalidate to not >> + * corrupt them. >> + */ > > pmdp_invalidate() does: > > pmd_t entry = *pmdp; > set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); > > so it's not atomic and if CPU sets dirty or accessed in the middle of > this, they will be lost? > > But I don't see how the other invalidate caller > __split_huge_pmd_locked() deals with this either. Andrea, any idea? Looks like we didn't resolve this and meanwhile the patch is in mainline as ced108037c2aa. CC Andy who deals with TLB a lot these days. > Vlastimil > >> + if (pmd_dirty(*pmd)) >> + entry = pmd_mkdirty(entry); >> + if (pmd_young(*pmd)) >> + entry = pmd_mkyoung(entry); >> + >> entry = pmd_modify(entry, newprot); >> if (preserve_write) >> entry = pmd_mk_savedwrite(entry); >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>