On 9 Feb 2017, at 3:16, Naoya Horiguchi wrote: > On Sun, Feb 05, 2017 at 11:12:47AM -0500, Zi Yan wrote: >> From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> >> >> If one of callers of page migration starts to handle thp, >> memory management code start to see pmd migration entry, so we need >> to prepare for it before enabling. This patch changes various code >> point which checks the status of given pmds in order to prevent race >> between thp migration and the pmd-related works. >> >> ChangeLog v1 -> v2: >> - introduce pmd_related() (I know the naming is not good, but can't >> think up no better name. Any suggesntion is welcomed.) >> >> Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> >> >> ChangeLog v2 -> v3: >> - add is_swap_pmd() >> - a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(), >> or pmd_none() > > (nitpick) ... or normal pmd pointing to pte pages? Sure, I will add it. > >> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() >> - flush_cache_range() while set_pmd_migration_entry() >> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return >> true on pmd_migration_entry, so that migration entries are not >> treated as pmd page table entries. >> >> Signed-off-by: Zi Yan <zi.yan@xxxxxxxxxxxxxx> >> --- >> arch/x86/mm/gup.c | 4 +-- >> fs/proc/task_mmu.c | 22 ++++++++----- >> include/asm-generic/pgtable.h | 71 ---------------------------------------- >> include/linux/huge_mm.h | 21 ++++++++++-- >> include/linux/swapops.h | 74 +++++++++++++++++++++++++++++++++++++++++ >> mm/gup.c | 20 ++++++++++-- >> mm/huge_memory.c | 76 ++++++++++++++++++++++++++++++++++++------- >> mm/madvise.c | 2 ++ >> mm/memcontrol.c | 2 ++ >> mm/memory.c | 9 +++-- >> mm/memory_hotplug.c | 13 +++++++- >> mm/mempolicy.c | 1 + >> mm/mprotect.c | 6 ++-- >> mm/mremap.c | 2 +- >> mm/pagewalk.c | 2 ++ >> 15 files changed, 221 insertions(+), 104 deletions(-) >> >> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c >> index 0d4fb3ebbbac..78a153d90064 100644 >> --- a/arch/x86/mm/gup.c >> +++ b/arch/x86/mm/gup.c >> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, >> pmd_t pmd = *pmdp; >> >> next = pmd_addr_end(addr, end); >> - if (pmd_none(pmd)) >> + if (!pmd_present(pmd)) >> return 0; >> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) { >> + if (unlikely(pmd_large(pmd))) { >> /* >> * NUMA hinting faults need to be handled in the GUP >> * slowpath for accounting purposes and so that they >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index 6c07c7813b26..1e64d6898c68 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> >> ptl = pmd_trans_huge_lock(pmd, vma); >> if (ptl) { >> - smaps_pmd_entry(pmd, addr, walk); >> + if (pmd_present(*pmd)) >> + smaps_pmd_entry(pmd, addr, walk); >> spin_unlock(ptl); >> return 0; >> } >> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, >> goto out; >> } >> >> + if (!pmd_present(*pmd)) >> + goto out; >> + >> page = pmd_page(*pmd); >> >> /* Clear accessed and referenced bits. */ >> @@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, >> if (ptl) { >> u64 flags = 0, frame = 0; >> pmd_t pmd = *pmdp; >> + struct page *page; >> >> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd)) >> flags |= PM_SOFT_DIRTY; >> >> - /* >> - * Currently pmd for thp is always present because thp >> - * can not be swapped-out, migrated, or HWPOISONed >> - * (split in such cases instead.) >> - * This if-check is just to prepare for future implementation. >> - */ >> - if (pmd_present(pmd)) { >> - struct page *page = pmd_page(pmd); >> + if (is_pmd_migration_entry(pmd)) { >> + swp_entry_t entry = pmd_to_swp_entry(pmd); >> >> + frame = swp_type(entry) | >> + (swp_offset(entry) << MAX_SWAPFILES_SHIFT); >> + page = migration_entry_to_page(entry); >> + } else if (pmd_present(pmd)) { >> + page = pmd_page(pmd); >> if (page_mapcount(page) == 1) >> flags |= PM_MMAP_EXCLUSIVE; >> >> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h >> index b71a431ed649..6cf9e9b5a7be 100644 >> --- a/include/asm-generic/pgtable.h >> +++ b/include/asm-generic/pgtable.h >> @@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) >> #ifndef arch_needs_pgtable_deposit >> #define arch_needs_pgtable_deposit() (false) >> #endif >> -/* >> - * This function is meant to be used by sites walking pagetables with >> - * the mmap_sem hold in read mode to protect against MADV_DONTNEED and >> - * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd >> - * into a null pmd and the transhuge page fault can convert a null pmd >> - * into an hugepmd or into a regular pmd (if the hugepage allocation >> - * fails). While holding the mmap_sem in read mode the pmd becomes >> - * stable and stops changing under us only if it's not null and not a >> - * transhuge pmd. When those races occurs and this function makes a >> - * difference vs the standard pmd_none_or_clear_bad, the result is >> - * undefined so behaving like if the pmd was none is safe (because it >> - * can return none anyway). The compiler level barrier() is critically >> - * important to compute the two checks atomically on the same pmdval. >> - * >> - * For 32bit kernels with a 64bit large pmd_t this automatically takes >> - * care of reading the pmd atomically to avoid SMP race conditions >> - * against pmd_populate() when the mmap_sem is hold for reading by the >> - * caller (a special atomic read not done by "gcc" as in the generic >> - * version above, is also needed when THP is disabled because the page >> - * fault can populate the pmd from under us). >> - */ >> -static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) >> -{ >> - pmd_t pmdval = pmd_read_atomic(pmd); >> - /* >> - * The barrier will stabilize the pmdval in a register or on >> - * the stack so that it will stop changing under the code. >> - * >> - * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE, >> - * pmd_read_atomic is allowed to return a not atomic pmdval >> - * (for example pointing to an hugepage that has never been >> - * mapped in the pmd). The below checks will only care about >> - * the low part of the pmd with 32bit PAE x86 anyway, with the >> - * exception of pmd_none(). So the important thing is that if >> - * the low part of the pmd is found null, the high part will >> - * be also null or the pmd_none() check below would be >> - * confused. >> - */ >> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> - barrier(); >> -#endif >> - if (pmd_none(pmdval) || pmd_trans_huge(pmdval)) >> - return 1; >> - if (unlikely(pmd_bad(pmdval))) { >> - pmd_clear_bad(pmd); >> - return 1; >> - } >> - return 0; >> -} >> - >> -/* >> - * This is a noop if Transparent Hugepage Support is not built into >> - * the kernel. Otherwise it is equivalent to >> - * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in >> - * places that already verified the pmd is not none and they want to >> - * walk ptes while holding the mmap sem in read mode (write mode don't >> - * need this). If THP is not enabled, the pmd can't go away under the >> - * code even if MADV_DONTNEED runs, but if THP is enabled we need to >> - * run a pmd_trans_unstable before walking the ptes after >> - * split_huge_page_pmd returns (because it may have run when the pmd >> - * become null, but then a page fault can map in a THP and not a >> - * regular page). >> - */ >> -static inline int pmd_trans_unstable(pmd_t *pmd) >> -{ >> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> - return pmd_none_or_trans_huge_or_clear_bad(pmd); >> -#else >> - return 0; >> -#endif >> -} >> >> #ifndef CONFIG_NUMA_BALANCING >> /* >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 83a8d42f9d55..c2e5a4eab84a 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> #define split_huge_pmd(__vma, __pmd, __address) \ >> do { \ >> pmd_t *____pmd = (__pmd); \ >> - if (pmd_trans_huge(*____pmd) \ >> + if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \ >> || pmd_devmap(*____pmd)) \ >> __split_huge_pmd(__vma, __pmd, __address, \ >> false, NULL); \ >> @@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, >> struct vm_area_struct *vma); >> extern spinlock_t *__pud_trans_huge_lock(pud_t *pud, >> struct vm_area_struct *vma); >> + >> +static inline int is_swap_pmd(pmd_t pmd) >> +{ >> + return !pmd_none(pmd) && !pmd_present(pmd); >> +} >> + >> /* mmap_sem must be held on entry */ >> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, >> struct vm_area_struct *vma) >> { >> VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma); >> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) >> + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) >> return __pmd_trans_huge_lock(pmd, vma); >> else >> return NULL; >> @@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, >> pmd_t *pmd, int flags); >> struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, >> pud_t *pud, int flags); >> +static inline int hpage_order(struct page *page) >> +{ >> + if (unlikely(PageTransHuge(page))) >> + return HPAGE_PMD_ORDER; >> + return 0; >> +} >> >> extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd); >> >> @@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void) >> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; }) >> >> #define hpage_nr_pages(x) 1 >> +#define hpage_order(x) 0 >> >> #define transparent_hugepage_enabled(__vma) 0 >> >> @@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, >> long adjust_next) >> { >> } >> +static inline int is_swap_pmd(pmd_t pmd) >> +{ >> + return 0; >> +} >> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, >> struct vm_area_struct *vma) >> { >> diff --git a/include/linux/swapops.h b/include/linux/swapops.h >> index 6625bea13869..50e4aa7e7ff9 100644 >> --- a/include/linux/swapops.h >> +++ b/include/linux/swapops.h >> @@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pmd) >> } >> #endif >> >> +/* >> + * This function is meant to be used by sites walking pagetables with >> + * the mmap_sem hold in read mode to protect against MADV_DONTNEED and >> + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd >> + * into a null pmd and the transhuge page fault can convert a null pmd >> + * into an hugepmd or into a regular pmd (if the hugepage allocation >> + * fails). While holding the mmap_sem in read mode the pmd becomes >> + * stable and stops changing under us only if it's not null and not a >> + * transhuge pmd. When those races occurs and this function makes a >> + * difference vs the standard pmd_none_or_clear_bad, the result is >> + * undefined so behaving like if the pmd was none is safe (because it >> + * can return none anyway). The compiler level barrier() is critically >> + * important to compute the two checks atomically on the same pmdval. >> + * >> + * For 32bit kernels with a 64bit large pmd_t this automatically takes >> + * care of reading the pmd atomically to avoid SMP race conditions >> + * against pmd_populate() when the mmap_sem is hold for reading by the >> + * caller (a special atomic read not done by "gcc" as in the generic >> + * version above, is also needed when THP is disabled because the page >> + * fault can populate the pmd from under us). >> + */ >> +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) >> +{ >> + pmd_t pmdval = pmd_read_atomic(pmd); >> + /* >> + * The barrier will stabilize the pmdval in a register or on >> + * the stack so that it will stop changing under the code. >> + * >> + * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE, >> + * pmd_read_atomic is allowed to return a not atomic pmdval >> + * (for example pointing to an hugepage that has never been >> + * mapped in the pmd). The below checks will only care about >> + * the low part of the pmd with 32bit PAE x86 anyway, with the >> + * exception of pmd_none(). So the important thing is that if >> + * the low part of the pmd is found null, the high part will >> + * be also null or the pmd_none() check below would be >> + * confused. >> + */ >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + barrier(); >> +#endif >> + if (pmd_none(pmdval) || pmd_trans_huge(pmdval) >> + || is_pmd_migration_entry(pmdval)) >> + return 1; >> + if (unlikely(pmd_bad(pmdval))) { >> + pmd_clear_bad(pmd); >> + return 1; >> + } >> + return 0; >> +} >> + >> +/* >> + * This is a noop if Transparent Hugepage Support is not built into >> + * the kernel. Otherwise it is equivalent to >> + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in >> + * places that already verified the pmd is not none and they want to >> + * walk ptes while holding the mmap sem in read mode (write mode don't >> + * need this). If THP is not enabled, the pmd can't go away under the >> + * code even if MADV_DONTNEED runs, but if THP is enabled we need to >> + * run a pmd_trans_unstable before walking the ptes after >> + * split_huge_page_pmd returns (because it may have run when the pmd >> + * become null, but then a page fault can map in a THP and not a >> + * regular page). >> + */ >> +static inline int pmd_trans_unstable(pmd_t *pmd) >> +{ >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + return pmd_none_or_trans_huge_or_clear_bad(pmd); >> +#else >> + return 0; >> +#endif >> +} >> + >> + > > These functions are page table or thp matter, so putting them in swapops.h > looks weird to me. Maybe you can avoid this code transfer by using !pmd_present > instead of is_pmd_migration_entry? > And we have to consider renaming pmd_none_or_trans_huge_or_clear_bad(), > I like a simple name like __pmd_trans_unstable(), but if you have an idea, > that's great. Yes. I will move it back. I am not sure if it is OK to only use !pmd_present. We may miss some pmd_bad. Kirill and Andrea, can you give some insight on this? >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 19b460acb5e1..9cb4c83151a8 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c > > Changes on mm/memory_hotplug.c should be with patch 14/14? > # If that's right, definition of hpage_order() also should go to 14/14. Got it. I will move it. -- Best Regards Yan Zi
Attachment:
signature.asc
Description: OpenPGP digital signature