The patch titled Subject: mm: set folio swapbacked iff folios are dirty in try_to_unmap_one has been added to the -mm mm-unstable branch. Its filename is mm-set-folio-swapbacked-iff-folios-are-dirty-in-try_to_unmap_one.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-set-folio-swapbacked-iff-folios-are-dirty-in-try_to_unmap_one.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Barry Song <v-songbaohua@xxxxxxxx> Subject: mm: set folio swapbacked iff folios are dirty in try_to_unmap_one Date: Wed, 15 Jan 2025 16:38:05 +1300 Patch series "mm: batched unmap lazyfree large folios during reclamation", v3. Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c. However, those folios are still added to the deferred_split list in try_to_unmap_one() because we are unmapping PTEs and removing rmap entries one by one. Firstly, this has rendered the following counter somewhat confusing, /sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred The split_deferred counter was originally designed to track operations such as partial unmap or madvise of large folios. However, in practice, most split_deferred cases arise from memory reclamation of aligned lazyfree mTHPs as observed by Tangquan. This discrepancy has made the split_deferred counter highly misleading. Secondly, this approach is slow because it requires iterating through each PTE and removing the rmap one by one for a large folio. In fact, all PTEs of a pte-mapped large folio should be unmapped at once, and the entire folio should be removed from the rmap as a whole. Thirdly, it also increases the risk of a race condition where lazyfree folios are incorrectly set back to swapbacked, as a speculative folio_get may occur in the shrinker's callback. deferred_split_scan() might call folio_try_get(folio) since we have added the folio to split_deferred list while removing rmap for the 1st subpage, and while we are scanning the 2nd to nr_pages PTEs of this folio in try_to_unmap_one(), the entire mTHP could be transitioned back to swap-backed because the reference count is incremented, which can make "ref_count == 1 + map_count" within try_to_unmap_one() false. /* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || ... (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } This patchset resolves the issue by marking only genuinely dirty folios as swap-backed, as suggested by David, and transitioning to batched unmapping of entire folios in try_to_unmap_one(). Consequently, the deferred_split count drops to zero, and memory reclamation performance improves significantly â?? reclaiming 64KiB lazyfree large folios is now 2.5x faster(The specific data is embedded in the changelog of patch 3/4). By the way, while the patchset is primarily aimed at PTE-mapped large folios, Baolin and Lance also found that try_to_unmap_one() handles lazyfree redirtied PMD-mapped large folios inefficiently â?? it splits the PMD into PTEs and iterates over them. This patchset removes the unnecessary splitting, enabling us to skip redirtied PMD-mapped large folios 3.5X faster during memory reclamation. (The specific data can be found in the changelog of patch 4/4). This patch (of 4): The refcount may be temporarily or long-term increased, but this does not change the fundamental nature of the folio already being lazy- freed. Therefore, we only reset 'swapbacked' when we are certain the folio is dirty and not droppable. Link: https://lkml.kernel.org/r/20250115033808.40641-1-21cnbao@xxxxxxxxx Link: https://lkml.kernel.org/r/20250115033808.40641-2-21cnbao@xxxxxxxxx Fixes: 6c8e2a256915 ("mm: fix race between MADV_FREE reclaim and blkdev direct IO read") Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> Suggested-by: David Hildenbrand <david@xxxxxxxxxx> Acked-by: David Hildenbrand <david@xxxxxxxxxx> Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> Reviewed-by: Lance Yang <ioworker0@xxxxxxxxx> Cc: Mauricio Faria de Oliveira <mfo@xxxxxxxxxxxxx> Cc: Chis Li <chrisl@xxxxxxxxxx> (Google) Cc: "Huang, Ying" <ying.huang@xxxxxxxxx> Cc: Kairui Song <kasong@xxxxxxxxxxx> Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Cc: Ryan Roberts <ryan.roberts@xxxxxxx> Cc: Tangquan Zheng <zhengtangquan@xxxxxxxx> Cc: Albert Ou <aou@xxxxxxxxxxxxxxxxx> Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx> Cc: Borislav Petkov <bp@xxxxxxxxx> Cc: Catalin Marinas <catalin.marinas@xxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> Cc: Gavin Shan <gshan@xxxxxxxxxx> Cc: "H. Peter Anvin" <hpa@xxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> Cc: Mark Rutland <mark.rutland@xxxxxxx> Cc: Palmer Dabbelt <palmer@xxxxxxxxxxx> Cc: Paul Walmsley <paul.walmsley@xxxxxxxxxx> Cc: Shaoqin Huang <shahuang@xxxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx> Cc: Yicong Yang <yangyicong@xxxxxxxxxxxxx> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/rmap.c | 49 ++++++++++++++++++++++--------------------------- 1 file changed, 22 insertions(+), 27 deletions(-) --- a/mm/rmap.c~mm-set-folio-swapbacked-iff-folios-are-dirty-in-try_to_unmap_one +++ a/mm/rmap.c @@ -1868,34 +1868,29 @@ static bool try_to_unmap_one(struct foli */ smp_rmb(); - /* - * The only page refs must be one from isolation - * plus the rmap(s) (dropped by discard:). - */ - if (ref_count == 1 + map_count && - (!folio_test_dirty(folio) || - /* - * Unlike MADV_FREE mappings, VM_DROPPABLE - * ones can be dropped even if they've - * been dirtied. - */ - (vma->vm_flags & VM_DROPPABLE))) { - dec_mm_counter(mm, MM_ANONPAGES); - goto discard; - } - - /* - * If the folio was redirtied, it cannot be - * discarded. Remap the page to page table. - */ - set_pte_at(mm, address, pvmw.pte, pteval); - /* - * Unlike MADV_FREE mappings, VM_DROPPABLE ones - * never get swap backed on failure to drop. - */ - if (!(vma->vm_flags & VM_DROPPABLE)) + if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) { + /* + * redirtied either using the page table or a previously + * obtained GUP reference. + */ + set_pte_at(mm, address, pvmw.pte, pteval); folio_set_swapbacked(folio); - goto walk_abort; + goto walk_abort; + } else if (ref_count != 1 + map_count) { + /* + * Additional reference. Could be a GUP reference or any + * speculative reference. GUP users must mark the folio + * dirty if there was a modification. This folio cannot be + * reclaimed right now either way, so act just like nothing + * happened. + * We'll come back here later and detect if the folio was + * dirtied when the additional reference is gone. + */ + set_pte_at(mm, address, pvmw.pte, pteval); + goto walk_abort; + } + dec_mm_counter(mm, MM_ANONPAGES); + goto discard; } if (swap_duplicate(entry) < 0) { _ Patches currently in -mm which might be from v-songbaohua@xxxxxxxx are mm-set-folio-swapbacked-iff-folios-are-dirty-in-try_to_unmap_one.patch mm-support-tlbbatch-flush-for-a-range-of-ptes.patch mm-support-batched-unmap-for-lazyfree-large-folios-during-reclamation.patch mm-avoid-splitting-pmd-for-lazyfree-pmd-mapped-thp-in-try_to_unmap.patch