On Mon, Dec 30, 2024 at 8:52 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 30.12.24 12:54, Barry Song wrote: > > On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 30.12.24 03:14, Lance Yang wrote: > >>> Hi Barry, > >>> > >>> On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > >>>> > >>>> Hi Lance, > >>>> > >>>> Along with Ryan, David, Baolin, and anyone else who might be interested, > >>>> > >>>> We’ve noticed an unexpectedly high number of deferred splits. The root > >>>> cause appears to be the changes introduced in commit dce7d10be4bbd3 > >>>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since > >>>> that commit, split_folio is no longer called in mm/madvise.c. > >> > >> Hi, > >> > >> I assume you don't see "deferred splits" at all. You see that a folio > >> was added to the deferred split queue to immediately be removed again as > >> it gets freed. Correct? > >> > >>>> > >>>> However, we are still performing deferred_split_folio for all > >>>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP. > >>>> This happens because we execute a goto discard in > >>>> try_to_unmap_one(), which eventually leads to > >>>> folio_remove_rmap_pte() adding all folios to deferred_split when we > >>>> scan the 1st pte in try_to_unmap_one(). > >>>> > >>>> discard: > >>>> if (unlikely(folio_test_hugetlb(folio))) > >>>> hugetlb_remove_rmap(folio); > >>>> else > >>>> folio_remove_rmap_pte(folio, subpage, vma); > >> > >> Yes, that's kind-of know: we neither do PTE batching during unmap for > >> reclaim nor during unmap for migration. We should add that support. > >> > >> But note, just like I raised earlier in the context of similar to > >> "improved partial-mapped logic in rmap code when batching", we are > >> primarily only pleasing counters here. > >> > >> See below on concurrent shrinker. > >> > >>>> > >>>> This could lead to a race condition with shrinker - deferred_split_scan(). > >>>> The shrinker might call folio_try_get(folio), and while we are scanning > >>>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP > >>>> could be transitioned back to swap-backed because the reference count > >>>> is incremented. > >>>> > >>>> /* > >>>> * The only page refs must be one from isolation > >>>> * plus the rmap(s) (dropped by discard:). > >>>> */ > >>>> if (ref_count == 1 + map_count && > >>>> (!folio_test_dirty(folio) || > >>>> ... > >>>> (vma->vm_flags & VM_DROPPABLE))) { > >>>> dec_mm_counter(mm, MM_ANONPAGES); > >>>> goto discard; > >>>> } > >> > >> > >> Reclaim code holds an additional folio reference and has the folio > >> locked. So I don't think this race can really happen in the way you > >> think it could? Please feel free to correct me if I am wrong. > > > > try_to_unmap_one will only execute "goto discard" and remove the rmap if > > ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker > > can invalidate this condition, leading to the restoration of the PTE and setting > > the folio as swap-backed. > > > > /* > > * The only page refs must be one from isolation > > * plus the rmap(s) (dropped by discard:). > > */ > > if (ref_count == 1 + map_count && > > (!folio_test_dirty(folio) || > > /* > > * Unlike MADV_FREE mappings, VM_DROPPABLE > > * ones can be dropped even if they've > > * been dirtied. > > */ > > (vma->vm_flags & VM_DROPPABLE))) { > > dec_mm_counter(mm, MM_ANONPAGES); > > goto discard; > > } > > > > /* > > * If the folio was redirtied, it cannot be > > * discarded. Remap the page to page table. > > */ > > set_pte_at(mm, address, pvmw.pte, pteval); > > /* > > * Unlike MADV_FREE mappings, VM_DROPPABLE ones > > * never get swap backed on failure to drop. > > */ > > if (!(vma->vm_flags & VM_DROPPABLE)) > > folio_set_swapbacked(folio); > > goto walk_abort; > > Ah, that's what you mean. Yes, but the shrinker behaves mostly like just > any other speculative reference. > > So we're not actually handling speculative references here correctly, so > this issue is not completely shrinker-specific. > > Maybe, we should be doing something like this? > > /* > * Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if > * they've been dirtied. > */ > if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) { > /* > * redirtied either using the page table or a previously > * obtained GUP reference. > */ > set_pte_at(mm, address, pvmw.pte, pteval); > folio_set_swapbacked(folio); > goto walk_abort; > } else if (ref_count != 1 + map_count) { > /* > * Additional reference. Could be a GUP reference or any > * speculative reference. GUP users must mark the folio dirty if > * there was a modification. This folio cannot be reclaimed > * right now either way, so act just like nothing happened. > * We'll come back here later and detect if the folio was > * dirtied when the additional reference is gone. > */ > set_pte_at(mm, address, pvmw.pte, pteval); > goto walk_abort; > } > goto discard; > > > Probably cleaning up goto labels. Nice! It looks simple and easy to understand ;) It might be reasonable to temporarily ignore the folio and check later whether it was marked dirty after the reference is released, as the additional reference could be a GUP or another speculative reference. Personally, I also prefer this approach. Thanks, Lance > > > -- > Cheers, > > David / dhildenb >