On 30.12.24 12:54, Barry Song wrote:
On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
On 30.12.24 03:14, Lance Yang wrote:
Hi Barry,
On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
Hi Lance,
Along with Ryan, David, Baolin, and anyone else who might be interested,
We’ve noticed an unexpectedly high number of deferred splits. The root
cause appears to be the changes introduced in commit dce7d10be4bbd3
("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
that commit, split_folio is no longer called in mm/madvise.c.
Hi,
I assume you don't see "deferred splits" at all. You see that a folio
was added to the deferred split queue to immediately be removed again as
it gets freed. Correct?
However, we are still performing deferred_split_folio for all
MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
This happens because we execute a goto discard in
try_to_unmap_one(), which eventually leads to
folio_remove_rmap_pte() adding all folios to deferred_split when we
scan the 1st pte in try_to_unmap_one().
discard:
if (unlikely(folio_test_hugetlb(folio)))
hugetlb_remove_rmap(folio);
else
folio_remove_rmap_pte(folio, subpage, vma);
Yes, that's kind-of know: we neither do PTE batching during unmap for
reclaim nor during unmap for migration. We should add that support.
But note, just like I raised earlier in the context of similar to
"improved partial-mapped logic in rmap code when batching", we are
primarily only pleasing counters here.
See below on concurrent shrinker.
This could lead to a race condition with shrinker - deferred_split_scan().
The shrinker might call folio_try_get(folio), and while we are scanning
the second PTE of this folio in try_to_unmap_one(), the entire mTHP
could be transitioned back to swap-backed because the reference count
is incremented.
/*
* The only page refs must be one from isolation
* plus the rmap(s) (dropped by discard:).
*/
if (ref_count == 1 + map_count &&
(!folio_test_dirty(folio) ||
...
(vma->vm_flags & VM_DROPPABLE))) {
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}
Reclaim code holds an additional folio reference and has the folio
locked. So I don't think this race can really happen in the way you
think it could? Please feel free to correct me if I am wrong.
try_to_unmap_one will only execute "goto discard" and remove the rmap if
ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
can invalidate this condition, leading to the restoration of the PTE and setting
the folio as swap-backed.
/*
* The only page refs must be one from isolation
* plus the rmap(s) (dropped by discard:).
*/
if (ref_count == 1 + map_count &&
(!folio_test_dirty(folio) ||
/*
* Unlike MADV_FREE mappings, VM_DROPPABLE
* ones can be dropped even if they've
* been dirtied.
*/
(vma->vm_flags & VM_DROPPABLE))) {
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}
/*
* If the folio was redirtied, it cannot be
* discarded. Remap the page to page table.
*/
set_pte_at(mm, address, pvmw.pte, pteval);
/*
* Unlike MADV_FREE mappings, VM_DROPPABLE ones
* never get swap backed on failure to drop.
*/
if (!(vma->vm_flags & VM_DROPPABLE))
folio_set_swapbacked(folio);
goto walk_abort;
Ah, that's what you mean. Yes, but the shrinker behaves mostly like just
any other speculative reference.
So we're not actually handling speculative references here correctly, so
this issue is not completely shrinker-specific.
Maybe, we should be doing something like this?
/*
* Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if
* they've been dirtied.
*/
if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
/*
* redirtied either using the page table or a previously
* obtained GUP reference.
*/
set_pte_at(mm, address, pvmw.pte, pteval);
folio_set_swapbacked(folio);
goto walk_abort;
} else if (ref_count != 1 + map_count) {
/*
* Additional reference. Could be a GUP reference or any
* speculative reference. GUP users must mark the folio dirty if
* there was a modification. This folio cannot be reclaimed
* right now either way, so act just like nothing happened.
* We'll come back here later and detect if the folio was
* dirtied when the additional reference is gone.
*/
set_pte_at(mm, address, pvmw.pte, pteval);
goto walk_abort;
}
goto discard;
Probably cleaning up goto labels.
--
Cheers,
David / dhildenb