On 10/24/22 14:55, Mike Kravetz wrote: > On 10/22/22 19:50, Mike Kravetz wrote: > > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the > > page tables associated with the address range. For hugetlb vmas, > > zap_page_range will call __unmap_hugepage_range_final. However, > > __unmap_hugepage_range_final assumes the passed vma is about to be > > removed and deletes the vma_lock to prevent pmd sharing as the vma is > > on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, > > but the missing vma_lock prevents pmd sharing and could potentially > > lead to issues with truncation/fault races. > > > > This issue was originally reported here [1] as a BUG triggered in > > page_try_dup_anon_rmap. Prior to the introduction of the hugetlb > > vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to > > prevent pmd sharing. Subsequent faults on this vma were confused as > > VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping > > was not set in new pages added to the page table. This resulted in > > pages that appeared anonymous in a VM_SHARED vma and triggered the BUG. > > > > Create a new routine clear_hugetlb_page_range() that can be called from > > madvise(MADV_DONTNEED) for hugetlb vmas. It has the same setup as > > zap_page_range, but does not delete the vma_lock. > > After seeing a syzbot use after free report [2] that is also addressed by > this patch, I started thinking ... > > When __unmap_hugepage_range_final was created, the only time unmap_single_vma > was called for hugetlb vmas was during process exit time via exit_mmap. I got > in trouble when I added a call via madvise(MADV_DONTNEED) which calls > zap_page_range. This patch takes care of that calling path by having > madvise(MADV_DONTNEED) call a new routine clear_hugetlb_page_range instead of > zap_page_range for hugetlb vmas. The use after free bug had me auditing code > paths to make sure __unmap_hugepage_range_final was REALLY only called at > process exit time. If not, and we could fault on a vma after calling > __unmap_hugepage_range_final we would be in trouble. > > My thought was, what if we had __unmap_hugepage_range_final check mm->mm_users > to determine if it was being called in the process exit path? If !mm_users, > then we can delete the vma_lock to prevent pmd sharing as we know the process > is exiting. If not, we do not delete the lock. That seems to be more robust > and would prevent issues if someone accidentally introduces a new code path > where __unmap_hugepage_range_final (unmap_single_vma for a hugetlb vma) > could be called outside process exit context. > > Thoughts? > > [2] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@xxxxxxxxxx/ Sorry if this seems like I am talking to myself. Here is a proposed v3 as described above. >From 1466fd43e180ede3f6479d1dca4e7f350f86f80b Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Date: Mon, 24 Oct 2022 15:40:05 -0700 Subject: [PATCH v3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing When hugetlb madvise(MADV_DONTNEED) support was added, the existing code to call zap_page_range() to clear the page tables associated with the address range was not modified. However, for hugetlb vmas zap_page_range will call __unmap_hugepage_range_final. This routine assumes the passed hugetlb vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races. This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG. __unmap_hugepage_range_final was originally designed only to be called in the context of process exit (exit_mmap). It is now called in the context of madvise(MADV_DONTNEED). Restructure the routine and check for !mm_users which indicates it is being called in the context of process exit. If being called in process exit context, delete the vma_lock. Otherwise, just unmap and leave the lock. Since the routine is called in more than just process exit context, rename to eliminate 'final' as __unmap_hugetlb_page_range. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@xxxxxxxxxxxxxx/ Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Reported-by: Wei Chen <harperchen1110@xxxxxxxxx> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> --- v3 - Check for !mm_users in __unmap_hugepage_range_final instead of creating a separate function. include/linux/hugetlb.h | 4 ++-- mm/hugetlb.c | 30 ++++++++++++++++++++---------- mm/memory.c | 2 +- 3 files changed, 23 insertions(+), 13 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a899bc76d677..bc19a1f6ca10 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -158,7 +158,7 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *, zap_flags_t); -void __unmap_hugepage_range_final(struct mmu_gather *tlb, +void __unmap_hugetlb_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags); @@ -418,7 +418,7 @@ static inline unsigned long hugetlb_change_protection( return 0; } -static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb, +static inline void __unmap_hugetlb_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 931789a8f734..3fe1152c3c20 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5202,27 +5202,37 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct tlb_flush_mmu_tlbonly(tlb); } -void __unmap_hugepage_range_final(struct mmu_gather *tlb, +void __unmap_hugetlb_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags) { + struct mm_struct *mm = vma->vm_mm; + hugetlb_vma_lock_write(vma); i_mmap_lock_write(vma->vm_file->f_mapping); __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags); /* - * Unlock and free the vma lock before releasing i_mmap_rwsem. When - * the vma_lock is freed, this makes the vma ineligible for pmd - * sharing. And, i_mmap_rwsem is required to set up pmd sharing. - * This is important as page tables for this unmapped range will - * be asynchrously deleted. If the page tables are shared, there - * will be issues when accessed by someone else. + * Free the vma_lock here if process exiting */ - __hugetlb_vma_unlock_write_free(vma); - - i_mmap_unlock_write(vma->vm_file->f_mapping); + if (!atomic_read(&mm->mm_users)) { + /* + * Unlock and free the vma lock before releasing i_mmap_rwsem. + * When the vma_lock is freed, this makes the vma ineligible + * for pmd sharing. And, i_mmap_rwsem is required to set up + * pmd sharing. This is important as page tables for this + * unmapped range will be asynchrously deleted. If the page + * tables are shared, there will be issues when accessed by + * someone else. + */ + __hugetlb_vma_unlock_write_free(vma); + i_mmap_unlock_write(vma->vm_file->f_mapping); + } else { + i_mmap_unlock_write(vma->vm_file->f_mapping); + hugetlb_vma_unlock_write(vma); + } } void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, diff --git a/mm/memory.c b/mm/memory.c index 8e72f703ed99..1de8ea504047 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1687,7 +1687,7 @@ static void unmap_single_vma(struct mmu_gather *tlb, if (vma->vm_file) { zap_flags_t zap_flags = details ? details->zap_flags : 0; - __unmap_hugepage_range_final(tlb, vma, start, end, + __unmap_hugetlb_page_range(tlb, vma, start, end, NULL, zap_flags); } } else -- 2.37.3