Hi Lance, Along with Ryan, David, Baolin, and anyone else who might be interested, We’ve noticed an unexpectedly high number of deferred splits. The root cause appears to be the changes introduced in commit dce7d10be4bbd3 ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since that commit, split_folio is no longer called in mm/madvise.c. However, we are still performing deferred_split_folio for all MADV_FREE mTHPs, even for those that are fully aligned with mTHP. This happens because we execute a goto discard in try_to_unmap_one(), which eventually leads to folio_remove_rmap_pte() adding all folios to deferred_split when we scan the 1st pte in try_to_unmap_one(). discard: if (unlikely(folio_test_hugetlb(folio))) hugetlb_remove_rmap(folio); else folio_remove_rmap_pte(folio, subpage, vma); This could lead to a race condition with shrinker - deferred_split_scan(). The shrinker might call folio_try_get(folio), and while we are scanning the second PTE of this folio in try_to_unmap_one(), the entire mTHP could be transitioned back to swap-backed because the reference count is incremented. /* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || ... (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } It also significantly increases contention on ds_queue->split_queue_lock during memory reclamation and could potentially introduce other race conditions with shrinker as well. I’m curious if anyone has suggestions for resolving this issue. My idea is to use folio_remove_rmap_ptes to drop all PTEs at once, rather than folio_remove_rmap_pte, which processes PTEs one by one for an mTHP. This approach would require some changes, such as checking the dirty state of PTEs and performing a TLB flush for the entire mTHP as a whole in try_to_unmap_one(). Please let me know if you have any objections or alternative suggestions. Thanks Barry