On Oct 31, 2022, at 3:34 PM, Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the page > tables associated with the address range. For hugetlb vmas, > zap_page_range will call __unmap_hugepage_range_final. However, > __unmap_hugepage_range_final assumes the passed vma is about to be removed > and deletes the vma_lock to prevent pmd sharing as the vma is on the way > out. In the case of madvise(MADV_DONTNEED) the vma remains, but the > missing vma_lock prevents pmd sharing and could potentially lead to issues > with truncation/fault races. > [snip] > index 978c17df053e..517c8cc8ccb9 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3464,4 +3464,7 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > */ > #define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0)) > > +/* Set in unmap_vmas() to indicate an unmap call. Only used by hugetlb */ > +#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1)) PeterZ wants to add ZAP_FLAG_FORCE_FLUSH that would be set on zap_pte_range(). Not sure you would want to combine them both together, but at least be aware of potential conflict. https://lore.kernel.org/all/Y1f7YvKuwOl1XEwU@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [snip] > +#ifdef CONFIG_ADVISE_SYSCALLS > +/* > + * Similar setup as in zap_page_range(). madvise(MADV_DONTNEED) can not call > + * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete > + * the associated vma_lock. > + */ > +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start, > + unsigned long end) > +{ > + struct mmu_notifier_range range; > + struct mmu_gather tlb; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, > + start, end); > + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); > + tlb_gather_mmu(&tlb, vma->vm_mm); > + update_hiwater_rss(vma->vm_mm); > + mmu_notifier_invalidate_range_start(&range); > + > + __unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0); > + > + mmu_notifier_invalidate_range_end(&range); > + tlb_finish_mmu(&tlb); > } > +#endif I hate ifdef’s. And the second definition of clear_hugetlb_page_range() is confusing since it does not have an ifdef at all. . How about moving the ifdef’s into the function like being done in io_madvise_prep()? I think it is less confusing. [ snip ] > > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1671,7 +1671,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt, > { > struct mmu_notifier_range range; > struct zap_details details = { > - .zap_flags = ZAP_FLAG_DROP_MARKER, > + .zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP, > /* Careful - we need to zap private pages too! */ > .even_cows = true, > }; > @@ -1704,15 +1704,21 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, > MA_STATE(mas, mt, vma->vm_end, vma->vm_end); > > lru_add_drain(); > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, > - start, start + size); > tlb_gather_mmu(&tlb, vma->vm_mm); > update_hiwater_rss(vma->vm_mm); > - mmu_notifier_invalidate_range_start(&range); > do { > - unmap_single_vma(&tlb, vma, start, range.end, NULL); > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, > + vma->vm_mm, > + max(start, vma->vm_start), > + min(start + size, vma->vm_end)); > + if (is_vm_hugetlb_page(vma)) > + adjust_range_if_pmd_sharing_possible(vma, > + &range.start, > + &range.end); > + mmu_notifier_invalidate_range_start(&range); > + unmap_single_vma(&tlb, vma, start, start + size, NULL); Is there a reason that you wouldn’t use range.start and range.end here? At least for consistency.