All MADV_FREE mTHPs are fully subjected to deferred_split_folio()

Barry Song <21cnbao@xxxxxxxxx> · Mon, 30 Dec 2024 10:12:57 +1300

Hi Lance,

Along with Ryan, David, Baolin, and anyone else who might be interested,

We’ve noticed an unexpectedly high number of deferred splits. The root
cause appears to be the changes introduced in commit dce7d10be4bbd3
("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
that commit, split_folio is no longer called in mm/madvise.c.

However, we are still performing deferred_split_folio for all
MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
This happens because we execute a goto discard in
try_to_unmap_one(), which eventually leads to
folio_remove_rmap_pte() adding all folios to deferred_split when we
scan the 1st pte in try_to_unmap_one().

discard:
                if (unlikely(folio_test_hugetlb(folio)))
                        hugetlb_remove_rmap(folio);
                else
                        folio_remove_rmap_pte(folio, subpage, vma);

This could lead to a race condition with shrinker - deferred_split_scan().
The shrinker might call folio_try_get(folio), and while we are scanning
the second PTE of this folio in try_to_unmap_one(), the entire mTHP
could be transitioned back to swap-backed because the reference count
is incremented.

                                /*
                                 * The only page refs must be one from isolation
                                 * plus the rmap(s) (dropped by discard:).
                                 */
                                if (ref_count == 1 + map_count &&
                                    (!folio_test_dirty(folio) ||
                                     ...
                                     (vma->vm_flags & VM_DROPPABLE))) {
                                        dec_mm_counter(mm, MM_ANONPAGES);
                                        goto discard;
                                }

It also significantly increases contention on ds_queue->split_queue_lock during
memory reclamation and could potentially introduce other race conditions with
shrinker as well.

I’m curious if anyone has suggestions for resolving this issue. My
idea is to use
folio_remove_rmap_ptes to drop all PTEs at once, rather than
folio_remove_rmap_pte,
which processes PTEs one by one for an mTHP. This approach would require some
changes, such as checking the dirty state of PTEs and performing a TLB
flush for the
entire mTHP as a whole in try_to_unmap_one().

Please let me know if you have any objections or alternative suggestions.

Thanks
Barry