The patch titled Subject: thp: fix leak due split_huge_page() vs. exit race has been added to the -mm tree. Its filename is thp-reintroduce-split_huge_page-fix-4.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/thp-reintroduce-split_huge_page-fix-4.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/thp-reintroduce-split_huge_page-fix-4.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/SubmitChecklist when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> Subject: thp: fix leak due split_huge_page() vs. exit race Consider following race: CPU0 CPU1 shrink_page_list() add_to_swap() split_huge_page_to_list() __split_huge_pmd_locked() pmdp_huge_clear_flush_notify() // pmd_none() == true exit_mmap() unmap_vmas() zap_pmd_range() // no action on pmd since pmd_none() == true pmd_populate() As result the THP will not be freed. The leak is detected by check_mm(): BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512 The patch restore the logic original split_huge_page() had before refcounting rework: never have intermediate pmd_none() == true. There are few other places where we do have pmd_none() == true for some time, but they are safe: - __split_huge_zero_page_pmd() is not reachable during exit, since huge zero page is not on LRU. - do_huge_pmd_wp_page() and do_huge_pmd_wp_page_fallback() are also not reachable during exit: exit_mmap() and handling page fault for the mm are mutual exclusive. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/huge_memory.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff -puN mm/huge_memory.c~thp-reintroduce-split_huge_page-fix-4 mm/huge_memory.c --- a/mm/huge_memory.c~thp-reintroduce-split_huge_page-fix-4 +++ a/mm/huge_memory.c @@ -2802,9 +2802,6 @@ static void __split_huge_pmd_locked(stru write = pmd_write(*pmd); young = pmd_young(*pmd); - /* leave pmd empty until pte is filled */ - pmdp_huge_clear_flush_notify(vma, haddr, pmd); - pgtable = pgtable_trans_huge_withdraw(mm, pmd); pmd_populate(mm, &_pmd, pgtable); @@ -2854,6 +2851,28 @@ static void __split_huge_pmd_locked(stru } smp_wmb(); /* make pte visible before pmd */ + /* + * Up to this point the pmd is present and huge and userland has the + * whole access to the hugepage during the split (which happens in + * place). If we overwrite the pmd with the not-huge version pointing + * to the pte here (which of course we could if all CPUs were bug + * free), userland could trigger a small page size TLB miss on the + * small sized TLB while the hugepage TLB entry is still established in + * the huge TLB. Some CPU doesn't like that. + * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum + * 383 on page 93. Intel should be safe but is also warns that it's + * only safe if the permission and cache attributes of the two entries + * loaded in the two TLB is identical (which should be the case here). + * But it is generally safer to never allow small and huge TLB entries + * for the same virtual address to be loaded simultaneously. So instead + * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the + * current pmd notpresent (atomically because here the pmd_trans_huge + * and pmd_trans_splitting must remain set at all times on the pmd + * until the split is complete for this pmd), then we flush the SMP TLB + * and finally we write the non-huge version of the pmd entry with + * pmd_populate. + */ + pmdp_invalidate(vma, haddr, pmd); pmd_populate(mm, pmd, pgtable); if (freeze) { _ Patches currently in -mm which might be from kirill.shutemov@xxxxxxxxxxxxxxx are mm-make-optimistic-check-for-swapin-readahead-fix.patch mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix.patch mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-2.patch mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-3.patch page-flags-trivial-cleanup-for-pagetrans-helpers.patch page-flags-move-code-around.patch page-flags-introduce-page-flags-policies-wrt-compound-pages.patch page-flags-introduce-page-flags-policies-wrt-compound-pages-fix.patch page-flags-introduce-page-flags-policies-wrt-compound-pages-fix-fix.patch page-flags-introduce-page-flags-policies-wrt-compound-pages-fix-3.patch page-flags-define-pg_locked-behavior-on-compound-pages.patch page-flags-define-behavior-of-fs-io-related-flags-on-compound-pages.patch page-flags-define-behavior-of-lru-related-flags-on-compound-pages.patch page-flags-define-behavior-slb-related-flags-on-compound-pages.patch page-flags-define-behavior-of-xen-related-flags-on-compound-pages.patch page-flags-define-pg_reserved-behavior-on-compound-pages.patch page-flags-define-pg_reserved-behavior-on-compound-pages-fix.patch page-flags-define-pg_swapbacked-behavior-on-compound-pages.patch page-flags-define-pg_swapcache-behavior-on-compound-pages.patch page-flags-define-pg_mlocked-behavior-on-compound-pages.patch page-flags-define-pg_uncached-behavior-on-compound-pages.patch page-flags-define-pg_uptodate-behavior-on-compound-pages.patch page-flags-look-at-head-page-if-the-flag-is-encoded-in-page-mapping.patch mm-sanitize-page-mapping-for-tail-pages.patch mm-proc-adjust-pss-calculation.patch rmap-add-argument-to-charge-compound-page.patch memcg-adjust-to-support-new-thp-refcounting.patch mm-thp-adjust-conditions-when-we-can-reuse-the-page-on-wp-fault.patch mm-adjust-foll_split-for-new-refcounting.patch mm-handle-pte-mapped-tail-pages-in-gerneric-fast-gup-implementaiton.patch thp-mlock-do-not-allow-huge-pages-in-mlocked-area.patch khugepaged-ignore-pmd-tables-with-thp-mapped-with-ptes.patch thp-rename-split_huge_page_pmd-to-split_huge_pmd.patch mm-vmstats-new-thp-splitting-event.patch mm-temporally-mark-thp-broken.patch thp-drop-all-split_huge_page-related-code.patch mm-drop-tail-page-refcounting.patch futex-thp-remove-special-case-for-thp-in-get_futex_key.patch ksm-prepare-to-new-thp-semantics.patch mm-thp-remove-compound_lock.patch arm64-thp-remove-infrastructure-for-handling-splitting-pmds.patch arm-thp-remove-infrastructure-for-handling-splitting-pmds.patch mips-thp-remove-infrastructure-for-handling-splitting-pmds.patch powerpc-thp-remove-infrastructure-for-handling-splitting-pmds.patch s390-thp-remove-infrastructure-for-handling-splitting-pmds.patch sparc-thp-remove-infrastructure-for-handling-splitting-pmds.patch tile-thp-remove-infrastructure-for-handling-splitting-pmds.patch x86-thp-remove-infrastructure-for-handling-splitting-pmds.patch mm-thp-remove-infrastructure-for-handling-splitting-pmds.patch mm-rework-mapcount-accounting-to-enable-4k-mapping-of-thps.patch mm-rework-mapcount-accounting-to-enable-4k-mapping-of-thps-fix-2.patch mm-rework-mapcount-accounting-to-enable-4k-mapping-of-thps-fix-3.patch mm-differentiate-page_mapped-from-page_mapcount-for-compound-pages.patch mm-numa-skip-pte-mapped-thp-on-numa-fault.patch thp-implement-split_huge_pmd.patch thp-add-option-to-setup-migration-entries-during-pmd-split.patch thp-mm-split_huge_page-caller-need-to-lock-page.patch thp-reintroduce-split_huge_page.patch thp-reintroduce-split_huge_page-fix-3.patch thp-reintroduce-split_huge_page-fix-4.patch migrate_pages-try-to-split-pages-on-qeueuing.patch thp-introduce-deferred_split_huge_page.patch mm-re-enable-thp.patch thp-update-documentation.patch thp-allow-mlocked-thp-again.patch mm-prepare-page_referenced-and-page_idle-to-new-thp-refcounting.patch mm-prepare-page_referenced-and-page_idle-to-new-thp-refcounting-fix-fix.patch -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html