On 10/25/22 16:37, Rik van Riel wrote: > Hi Mike, > > After getting promising results initially, we discovered there > is yet another bug left with hugetlbfs MADV_DONTNEED. > > This one involves a page fault on a hugetlbfs address, while > another thread in the same process is in the middle of MADV_DONTNEED > on that same memory address. > > The code in __unmap_hugepage_range() will clear the page table > entry, and then at some point later the lazy TLB code will > actually free the huge page back into the hugetlbfs free page > pool. > > Meanwhile, hugetlb_no_page will call alloc_huge_page, and that > will fail because the code calling __unmap_hugepage_range() has > not actually returned the page to the free list yet. > > The result is that the process gets killed with SIGBUS. Thanks for the details Rik. That makes perfect sense. > I have thought of a few different solutions to this problem, but > none of them look good: > - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude > page faults. This could make MADV_DONTNEED on VMAs with 4kB > pages unacceptably slow. > - Some sort of atomic counter kept by __unmap_hugepage_range() > that huge pages may be getting placed in the tlb gather, and > freed later by tlb_finish_mmu(). This would involve changes > to the MMU gather code, outside of hugetlbfs. > - Some sort of generation counter that tracks tlb_gather_mmu > cycles in progress, with the alloc_huge_page failure path > waiting until all mmu gather operations that started before > it to finish, before retrying the allocation. This requires > changes to the generic code, outside of hugetlbfs. > > What are the reasonable alternatives here? > > Should we see if anybody can come up with a simple solution > to the problem, or would it be better to just disable > MADV_DONTNEED on hugetlbfs for now? I am thinking that we could use the vma_lock to prevent faults on the vma until the MADV_DONTNEED processing is totally complete. IIUC, it is the tlb_finish_mmu call that actually drops the ref count on the pages and returns them to the free list. Currently, we hold the vma_lock in write mode during unmap, and acquire it in read mode during faults. However, we drop it in the MADV_DONTNEED path (just a bit) before calling tlb_finish_mmu. So, holding a bit longer may address this issue rather simply. Below is another version of the "hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing" patch. It is very much like the version in Andrew's tree but the code has been shuffled a bit to hold the vma_lock longer in the MADV_DONTNEED path. Can you take a look and see if that addresses the issue for you? >From 9a1f047ce42f76b3befd673da5ee2bd49bea6c75 Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Date: Tue, 25 Oct 2022 20:02:58 -0700 Subject: [PATCH v4] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races. This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG. Create a new routine clear_hugetlb_page_range() that can be called from madvise(MADV_DONTNEED) for hugetlb vmas. It has the same setup as zap_page_range, but does not delete the vma_lock. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@xxxxxxxxxxxxxx/ Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Reported-by: Wei Chen <harperchen1110@xxxxxxxxx> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> --- include/linux/hugetlb.h | 7 +++++++ mm/hugetlb.c | 30 ++++++++++++++++++++++++++++++ mm/madvise.c | 5 ++++- 3 files changed, 41 insertions(+), 1 deletion(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a899bc76d677..0246e77be3a3 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -158,6 +158,8 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *, zap_flags_t); +void clear_hugetlb_page_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end); void __unmap_hugepage_range_final(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, @@ -426,6 +428,11 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb, BUG(); } +static void __maybe_unused clear_hugetlb_page_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ +} + static inline vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 931789a8f734..0a52c3938ce9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5221,9 +5221,39 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb, * will be issues when accessed by someone else. */ __hugetlb_vma_unlock_write_free(vma); + i_mmap_unlock_write(vma->vm_file->f_mapping); +} + +#ifdef CONFIG_ADVISE_SYSCALLS +/* + * Similar setup as in zap_page_range(). madvise(MADV_DONTNEED) can not call + * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete + * the associated vma_lock. + */ +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start, + unsigned long end) +{ + struct mmu_notifier_range range; + struct mmu_gather tlb; + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, + start, end); + tlb_gather_mmu(&tlb, vma->vm_mm); + update_hiwater_rss(vma->vm_mm); + + hugetlb_vma_lock_write(vma); + i_mmap_lock_write(vma->vm_file->f_mapping); + + __unmap_hugepage_range(&tlb, vma, start, end, NULL, 0); i_mmap_unlock_write(vma->vm_file->f_mapping); + + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb); + + hugetlb_vma_unlock_write(vma); } +#endif void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, diff --git a/mm/madvise.c b/mm/madvise.c index 2baa93ca2310..90577a669635 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, static long madvise_dontneed_single_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - zap_page_range(vma, start, end - start); + if (!is_vm_hugetlb_page(vma)) + zap_page_range(vma, start, end - start); + else + clear_hugetlb_page_range(vma, start, end); return 0; } -- 2.37.3