On Thu, Jun 6, 2024 at 12:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 05.06.24 17:43, Lance Yang wrote: > > On Wed, Jun 5, 2024 at 11:03 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 05.06.24 16:57, Lance Yang wrote: > >>> On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>>> > >>>> On 05.06.24 16:28, David Hildenbrand wrote: > >>>>> On 05.06.24 16:20, Lance Yang wrote: > >>>>>> Hi David, > >>>>>> > >>>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>>>>>> > >>>>>>> On 21.05.24 06:02, Lance Yang wrote: > >>>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped > >>>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to > >>>>>>>> split the folio. > >>>>>>>> > >>>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might > >>>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during > >>>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from > >>>>>>>> being picked up during page reclaim. > >>>>>>>> > >>>>>>>> Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > >>>>>>>> Suggested-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > >>>>>>>> Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx> > >>>>>>>> --- > >>>>>>> > >>>>>>> [...] again, sorry for the late review. > >>>>>> > >>>>>> No worries at all, thanks for taking time to review! > >>>>>> > >>>>>>> > >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>>>>>> index ddffa30c79fb..08a93347f283 100644 > >>>>>>>> --- a/mm/rmap.c > >>>>>>>> +++ b/mm/rmap.c > >>>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > >>>>>>>> if (flags & TTU_SYNC) > >>>>>>>> pvmw.flags = PVMW_SYNC; > >>>>>>>> > >>>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD) > >>>>>>>> - split_huge_pmd_address(vma, address, false, folio); > >>>>>>>> - > >>>>>>>> /* > >>>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation. > >>>>>>>> * For hugetlb, it could be much worse if we need to do pud > >>>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > >>>>>>>> mmu_notifier_invalidate_range_start(&range); > >>>>>>>> > >>>>>>>> while (page_vma_mapped_walk(&pvmw)) { > >>>>>>>> - /* Unexpected PMD-mapped THP? */ > >>>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio); > >>>>>>>> - > >>>>>>>> /* > >>>>>>>> * If the folio is in an mlock()d vma, we must not swap it out. > >>>>>>>> */ > >>>>>>>> if (!(flags & TTU_IGNORE_MLOCK) && > >>>>>>>> (vma->vm_flags & VM_LOCKED)) { > >>>>>>>> /* Restore the mlock which got missed */ > >>>>>>>> - if (!folio_test_large(folio)) > >>>>>>>> + if (!folio_test_large(folio) || > >>>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD))) > >>>>>>>> mlock_vma_folio(folio, vma); > >>>>>>> > >>>>>>> Can you elaborate why you think this would be required? If we would have > >>>>>>> performed the split_huge_pmd_address() beforehand, we would still be > >>>>>>> left with a large folio, no? > >>>>>> > >>>>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped. > >>>>>> > >>>>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large > >>>>>> folio, but there are a few scenarios where we don't mlock a large folio, such > >>>>>> as when it crosses a VM_LOCKed VMA boundary. > >>>>>> > >>>>>> - if (!folio_test_large(folio)) > >>>>>> + if (!folio_test_large(folio) || > >>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD))) > >>>>>> > >>>>>> And this check is just future-proofing and likely unnecessary. If encountering a > >>>>>> PMD-mapped THP missing the mlock for some reason, we can mlock this > >>>>>> THP to prevent it from being picked up during page reclaim, since it is fully > >>>>>> mapped and doesn't cross the VMA boundary, IIUC. > >>>>>> > >>>>>> What do you think? > >>>>>> I would appreciate any suggestions regarding this check ;) > >>>>> > >>>>> Reading this patch only, I wonder if this change makes sense in the > >>>>> context here. > >>>>> > >>>>> Before this patch, we would have PTE-mapped the PMD-mapped THP before > >>>>> reaching this call and skipped it due to "!folio_test_large(folio)". > >>>>> > >>>>> After this patch, we either > >>>>> > >>>>> a) PTE-remap the THP after this check, but retry and end-up here again, > >>>>> whereby we would skip it due to "!folio_test_large(folio)". > >>>>> > >>>>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that > >>>>> co-exist with mlock and what would be the problem here with mlock? > >>>>> > >>>>> > >>> > >>> Thanks a lot for clarifying! > >>> > >>>>> So if the check is required in this patch, we really have to understand > >>>>> why. If not, we should better drop it from this patch. > >>>>> > >>>>> At least my opinion, still struggling to understand why it would be > >>>>> required (I have 0 knowledge about mlock interaction with large folios :) ). > >>>>> > >>>> > >>>> Looking at that series, in folio_references_one(), we do > >>>> > >>>> if (!folio_test_large(folio) || !pvmw.pte) { > >>>> /* Restore the mlock which got missed */ > >>>> mlock_vma_folio(folio, vma); > >>>> page_vma_mapped_walk_done(&pvmw); > >>>> pra->vm_flags |= VM_LOCKED; > >>>> return false; /* To break the loop */ > >>>> } > >>>> > >>>> I wonder if we want that here as well now: in case of lazyfree we > >>>> would not back off, right? > >>>> > >>>> But I'm not sure if lazyfree in mlocked areas are even possible. > >>>> > >>>> Adding the "!pvmw.pte" would be much clearer to me than the flag check. > >>> > >>> Hmm... How about we drop it from this patch for now, and add it back if needed > >>> in the future? > >> > >> If we can rule out that MADV_FREE + mlock() keeps working as expected in > >> the PMD-mapped case, we're good. > >> > >> Can we rule that out? (especially for MADV_FREE followed by mlock()) > > > > Perhaps we don't worry about that. > > > > IIUC, without that check, MADV_FREE + mlock() still works as expected in > > the PMD-mapped case, since if encountering a large folio in a VM_LOCKED > > VMA range, we will stop the page walk immediately. > > > Can you point me at the code (especially considering patch #3?) Yep, please see my other mail ;) Thanks, Lance > > -- > Cheers, > > David / dhildenb >