On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 30.12.24 03:14, Lance Yang wrote: > > Hi Barry, > > > > On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > >> > >> Hi Lance, > >> > >> Along with Ryan, David, Baolin, and anyone else who might be interested, > >> > >> We’ve noticed an unexpectedly high number of deferred splits. The root > >> cause appears to be the changes introduced in commit dce7d10be4bbd3 > >> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since > >> that commit, split_folio is no longer called in mm/madvise.c. > > Hi, > > I assume you don't see "deferred splits" at all. You see that a folio > was added to the deferred split queue to immediately be removed again as > it gets freed. Correct? > > >> > >> However, we are still performing deferred_split_folio for all > >> MADV_FREE mTHPs, even for those that are fully aligned with mTHP. > >> This happens because we execute a goto discard in > >> try_to_unmap_one(), which eventually leads to > >> folio_remove_rmap_pte() adding all folios to deferred_split when we > >> scan the 1st pte in try_to_unmap_one(). > >> > >> discard: > >> if (unlikely(folio_test_hugetlb(folio))) > >> hugetlb_remove_rmap(folio); > >> else > >> folio_remove_rmap_pte(folio, subpage, vma); > > Yes, that's kind-of know: we neither do PTE batching during unmap for > reclaim nor during unmap for migration. We should add that support. > > But note, just like I raised earlier in the context of similar to > "improved partial-mapped logic in rmap code when batching", we are > primarily only pleasing counters here. > > See below on concurrent shrinker. > > >> > >> This could lead to a race condition with shrinker - deferred_split_scan(). > >> The shrinker might call folio_try_get(folio), and while we are scanning > >> the second PTE of this folio in try_to_unmap_one(), the entire mTHP > >> could be transitioned back to swap-backed because the reference count > >> is incremented. > >> > >> /* > >> * The only page refs must be one from isolation > >> * plus the rmap(s) (dropped by discard:). > >> */ > >> if (ref_count == 1 + map_count && > >> (!folio_test_dirty(folio) || > >> ... > >> (vma->vm_flags & VM_DROPPABLE))) { > >> dec_mm_counter(mm, MM_ANONPAGES); > >> goto discard; > >> } > > > Reclaim code holds an additional folio reference and has the folio > locked. So I don't think this race can really happen in the way you > think it could? Please feel free to correct me if I am wrong. try_to_unmap_one will only execute "goto discard" and remove the rmap if ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker can invalidate this condition, leading to the restoration of the PTE and setting the folio as swap-backed. /* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || /* * Unlike MADV_FREE mappings, VM_DROPPABLE * ones can be dropped even if they've * been dirtied. */ (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } /* * If the folio was redirtied, it cannot be * discarded. Remap the page to page table. */ set_pte_at(mm, address, pvmw.pte, pteval); /* * Unlike MADV_FREE mappings, VM_DROPPABLE ones * never get swap backed on failure to drop. */ if (!(vma->vm_flags & VM_DROPPABLE)) folio_set_swapbacked(folio); goto walk_abort; > > >> > >> It also significantly increases contention on ds_queue->split_queue_lock during > >> memory reclamation and could potentially introduce other race conditions with > >> shrinker as well. > > > > Good catch! > > > > Call me "skeptical" that this is a big issue, at least regarding the > shrinker, but also regarding actual lock contention. :) > > The issue might be less severe than you think: mostly pleasing counters. > But yes, there is room for improvement. > > >> > >> I’m curious if anyone has suggestions for resolving this issue. My > >> idea is to use > >> folio_remove_rmap_ptes to drop all PTEs at once, rather than > >> folio_remove_rmap_pte, > >> which processes PTEs one by one for an mTHP. This approach would require some > >> changes, such as checking the dirty state of PTEs and performing a TLB > >> flush for the > >> entire mTHP as a whole in try_to_unmap_one(). > > > > Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a whole > > in real-world scenarios where MADV_FREE mTHPs are typically no longer > > written ;) > > > We should be implementing folio batching. But it won't be able to cover > all cases. > > In the future, I envision that during reclaim/access bit scanning, we > determine whether a folio is partially mapped and add it to the deferred > split queue. That's one requirement for getting rid of > folio->_nr_page_mapped and reliably detecting all partial mappings, but > it also avoids having to messing with this information whenever we > (temporarily) unmap only parts of a folio, like we have here. > > Thanks! > > -- > Cheers, > > David / dhildenb >