On 30.12.24 03:14, Lance Yang wrote:
Hi Barry,
On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
Hi Lance,
Along with Ryan, David, Baolin, and anyone else who might be interested,
We’ve noticed an unexpectedly high number of deferred splits. The root
cause appears to be the changes introduced in commit dce7d10be4bbd3
("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
that commit, split_folio is no longer called in mm/madvise.c.
Hi,
I assume you don't see "deferred splits" at all. You see that a folio
was added to the deferred split queue to immediately be removed again as
it gets freed. Correct?
However, we are still performing deferred_split_folio for all
MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
This happens because we execute a goto discard in
try_to_unmap_one(), which eventually leads to
folio_remove_rmap_pte() adding all folios to deferred_split when we
scan the 1st pte in try_to_unmap_one().
discard:
if (unlikely(folio_test_hugetlb(folio)))
hugetlb_remove_rmap(folio);
else
folio_remove_rmap_pte(folio, subpage, vma);
Yes, that's kind-of know: we neither do PTE batching during unmap for
reclaim nor during unmap for migration. We should add that support.
But note, just like I raised earlier in the context of similar to
"improved partial-mapped logic in rmap code when batching", we are
primarily only pleasing counters here.
See below on concurrent shrinker.
This could lead to a race condition with shrinker - deferred_split_scan().
The shrinker might call folio_try_get(folio), and while we are scanning
the second PTE of this folio in try_to_unmap_one(), the entire mTHP
could be transitioned back to swap-backed because the reference count
is incremented.
/*
* The only page refs must be one from isolation
* plus the rmap(s) (dropped by discard:).
*/
if (ref_count == 1 + map_count &&
(!folio_test_dirty(folio) ||
...
(vma->vm_flags & VM_DROPPABLE))) {
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}
Reclaim code holds an additional folio reference and has the folio
locked. So I don't think this race can really happen in the way you
think it could? Please feel free to correct me if I am wrong.
It also significantly increases contention on ds_queue->split_queue_lock during
memory reclamation and could potentially introduce other race conditions with
shrinker as well.
Good catch!
Call me "skeptical" that this is a big issue, at least regarding the
shrinker, but also regarding actual lock contention. :)
The issue might be less severe than you think: mostly pleasing counters.
But yes, there is room for improvement.
I’m curious if anyone has suggestions for resolving this issue. My
idea is to use
folio_remove_rmap_ptes to drop all PTEs at once, rather than
folio_remove_rmap_pte,
which processes PTEs one by one for an mTHP. This approach would require some
changes, such as checking the dirty state of PTEs and performing a TLB
flush for the
entire mTHP as a whole in try_to_unmap_one().
Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a whole
in real-world scenarios where MADV_FREE mTHPs are typically no longer
written ;)
We should be implementing folio batching. But it won't be able to cover
all cases.
In the future, I envision that during reclaim/access bit scanning, we
determine whether a folio is partially mapped and add it to the deferred
split queue. That's one requirement for getting rid of
folio->_nr_page_mapped and reliably detecting all partial mappings, but
it also avoids having to messing with this information whenever we
(temporarily) unmap only parts of a folio, like we have here.
Thanks!
--
Cheers,
David / dhildenb