Hi Barry, On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > Hi Lance, > > Along with Ryan, David, Baolin, and anyone else who might be interested, > > We’ve noticed an unexpectedly high number of deferred splits. The root > cause appears to be the changes introduced in commit dce7d10be4bbd3 > ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since > that commit, split_folio is no longer called in mm/madvise.c. > > However, we are still performing deferred_split_folio for all > MADV_FREE mTHPs, even for those that are fully aligned with mTHP. > This happens because we execute a goto discard in > try_to_unmap_one(), which eventually leads to > folio_remove_rmap_pte() adding all folios to deferred_split when we > scan the 1st pte in try_to_unmap_one(). > > discard: > if (unlikely(folio_test_hugetlb(folio))) > hugetlb_remove_rmap(folio); > else > folio_remove_rmap_pte(folio, subpage, vma); > > This could lead to a race condition with shrinker - deferred_split_scan(). > The shrinker might call folio_try_get(folio), and while we are scanning > the second PTE of this folio in try_to_unmap_one(), the entire mTHP > could be transitioned back to swap-backed because the reference count > is incremented. > > /* > * The only page refs must be one from isolation > * plus the rmap(s) (dropped by discard:). > */ > if (ref_count == 1 + map_count && > (!folio_test_dirty(folio) || > ... > (vma->vm_flags & VM_DROPPABLE))) { > dec_mm_counter(mm, MM_ANONPAGES); > goto discard; > } > > It also significantly increases contention on ds_queue->split_queue_lock during > memory reclamation and could potentially introduce other race conditions with > shrinker as well. Good catch! > > I’m curious if anyone has suggestions for resolving this issue. My > idea is to use > folio_remove_rmap_ptes to drop all PTEs at once, rather than > folio_remove_rmap_pte, > which processes PTEs one by one for an mTHP. This approach would require some > changes, such as checking the dirty state of PTEs and performing a TLB > flush for the > entire mTHP as a whole in try_to_unmap_one(). Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a whole in real-world scenarios where MADV_FREE mTHPs are typically no longer written ;) > > Please let me know if you have any objections or alternative suggestions. Let's hear suggestions from other folks as well ~ Thanks, Lance > > Thanks > Barry