On 2/9/22 19:21, Peter Xu wrote: > (Sorry for the late comment) Thanks for taking a look. > > On Tue, Feb 01, 2022 at 05:40:32PM -0800, Mike Kravetz wrote: >> MADV_DONTNEED is currently disabled for hugetlb mappings. This >> certainly makes sense in shared file mappings as the pagecache maintains >> a reference to the page and it will never be freed. However, it could >> be useful to unmap and free pages in private mappings. >> >> The only thing preventing MADV_DONTNEED from working on hugetlb mappings >> is a check in can_madv_lru_vma(). To allow support for hugetlb mappings >> create and use a new routine madvise_dontneed_free_valid_vma() that will >> allow hugetlb mappings. Also, before calling zap_page_range in the >> DONTNEED case align start and size to huge page size for hugetlb vmas. >> madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine >> requires huge page size alignment. >> >> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >> --- >> mm/madvise.c | 24 ++++++++++++++++++++++-- >> 1 file changed, 22 insertions(+), 2 deletions(-) >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 5604064df464..7ae891e030a4 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, >> static long madvise_dontneed_single_vma(struct vm_area_struct *vma, >> unsigned long start, unsigned long end) >> { >> + /* >> + * start and size (end - start) must be huge page size aligned >> + * for hugetlb vmas. >> + */ >> + if (is_vm_hugetlb_page(vma)) { >> + struct hstate *h = hstate_vma(vma); >> + >> + start = ALIGN_DOWN(start, huge_page_size(h)); >> + end = ALIGN(end, huge_page_size(h)); >> + } >> + > > Maybe check the alignment before userfaultfd_remove()? Otherwise there'll be a > fake message generated to the tracer. Yes, we should pass the aligned addresses to userfaultfd_remove. We will also need to potentially align again after the call. > >> zap_page_range(vma, start, end - start); >> return 0; >> } >> >> +static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, >> + int behavior) >> +{ >> + if (is_vm_hugetlb_page(vma)) >> + return behavior == MADV_DONTNEED; >> + else >> + return can_madv_lru_vma(vma); >> +} > > can_madv_lru_vma() will check hugetlb again which looks a bit weird. Would it > look better to write it as: > > madvise_dontneed_free_valid_vma() > { > return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP)); > } > > can_madv_lru_vma() > { > return madvise_dontneed_free_valid_vma() && !is_vm_hugetlb_page(vma); > } > > ? Yes, that would look better. > > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu > this is the only api that can force strip the hugetlb mapped pgtable without > losing pagecache data. > Correct. However, I do not know if uffd-minor users would ever want to do this. Perhaps? -- Mike Kravetz