On 08/30/22 10:02, Miaohe Lin wrote: > On 2022/8/25 1:57, Mike Kravetz wrote: > > The new hugetlb vma lock (rw semaphore) is used to address this race: > > > > Faulting thread Unsharing thread > > ... ... > > ptep = huge_pte_offset() > > or > > ptep = huge_pte_alloc() > > ... > > i_mmap_lock_write > > lock page table > > ptep invalid <------------------------ huge_pmd_unshare() > > Could be in a previously unlock_page_table > > sharing process or worse i_mmap_unlock_write > > ... > > > > The vma_lock is used as follows: > > - During fault processing. the lock is acquired in read mode before > > doing a page table lock and allocation (huge_pte_alloc). The lock is > > held until code is finished with the page table entry (ptep). > > - The lock must be held in write mode whenever huge_pmd_unshare is > > called. > > > > Lock ordering issues come into play when unmapping a page from all > > vmas mapping the page. The i_mmap_rwsem must be held to search for the > > vmas, and the vma lock must be held before calling unmap which will > > call huge_pmd_unshare. This is done today in: > > - try_to_migrate_one and try_to_unmap_ for page migration and memory > > error handling. In these routines we 'try' to obtain the vma lock and > > fail to unmap if unsuccessful. Calling routines already deal with the > > failure of unmapping. > > - hugetlb_vmdelete_list for truncation and hole punch. This routine > > also tries to acquire the vma lock. If it fails, it skips the > > unmapping. However, we can not have file truncation or hole punch > > fail because of contention. After hugetlb_vmdelete_list, truncation > > and hole punch call remove_inode_hugepages. remove_inode_hugepages > > check for mapped pages and call hugetlb_unmap_file_page to unmap them. > > hugetlb_unmap_file_page is designed to drop locks and reacquire in the > > correct order to guarantee unmap success. > > > > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > --- > > fs/hugetlbfs/inode.c | 46 +++++++++++++++++++ > > mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++---- > > mm/memory.c | 2 + > > mm/rmap.c | 100 +++++++++++++++++++++++++++--------------- > > mm/userfaultfd.c | 9 +++- > > 5 files changed, 214 insertions(+), 45 deletions(-) > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index b93d131b0cb5..52d9b390389b 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h, > > struct folio *folio, pgoff_t index) > > { > > struct rb_root_cached *root = &mapping->i_mmap; > > + unsigned long skipped_vm_start; > > + struct mm_struct *skipped_mm; > > struct page *page = &folio->page; > > struct vm_area_struct *vma; > > unsigned long v_start; > > @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h, > > end = ((index + 1) * pages_per_huge_page(h)); > > > > i_mmap_lock_write(mapping); > > +retry: > > + skipped_mm = NULL; > > > > vma_interval_tree_foreach(vma, root, start, end - 1) { > > v_start = vma_offset_start(vma, start); > > @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h, > > if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page)) > > continue; > > > > + if (!hugetlb_vma_trylock_write(vma)) { > > + /* > > + * If we can not get vma lock, we need to drop > > + * immap_sema and take locks in order. > > + */ > > + skipped_vm_start = vma->vm_start; > > + skipped_mm = vma->vm_mm; > > + /* grab mm-struct as we will be dropping i_mmap_sema */ > > + mmgrab(skipped_mm); > > + break; > > + } > > + > > unmap_hugepage_range(vma, vma->vm_start + v_start, v_end, > > NULL, ZAP_FLAG_DROP_MARKER); > > + hugetlb_vma_unlock_write(vma); > > } > > > > i_mmap_unlock_write(mapping); > > + > > + if (skipped_mm) { > > + mmap_read_lock(skipped_mm); > > + vma = find_vma(skipped_mm, skipped_vm_start); > > + if (!vma || !is_vm_hugetlb_page(vma) || > > + vma->vm_file->f_mapping != mapping || > > + vma->vm_start != skipped_vm_start) { > > i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway. > Yes, that is missing. I will add here. > > + mmap_read_unlock(skipped_mm); > > + mmdrop(skipped_mm); > > + goto retry; > > + } > > + > > IMHO, above check is not enough. Think about the below scene: > > CPU 1 CPU 2 > hugetlb_unmap_file_folio exit_mmap > mmap_read_lock(skipped_mm); mmap_read_lock(mm); > check vma is wanted. > unmap_vmas > mmap_read_unlock(skipped_mm); mmap_read_unlock > mmap_write_lock(mm); > free_pgtables > remove_vma > hugetlb_vma_lock_free > vma, hugetlb_vma_lock is still *used after free* > mmap_write_unlock(mm); > So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something? In the retry case, we are OK because go back and look up the vma again. Right? After taking mmap_read_lock, vma can not go away until we mmap_read_unlock. Before that, we do the following: > > + hugetlb_vma_lock_write(vma); > > + i_mmap_lock_write(mapping); IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we can. > > + mmap_read_unlock(skipped_mm); > > + mmdrop(skipped_mm); We continue to hold i_mmap_lock_write as we goto retry. I could be missing something as well. This was how I intended to keep vma valid while dropping and acquiring locks. > > + > > + v_start = vma_offset_start(vma, start); > > + v_end = vma_offset_end(vma, end); > > + unmap_hugepage_range(vma, vma->vm_start + v_start, v_end, > > + NULL, ZAP_FLAG_DROP_MARKER); > > + hugetlb_vma_unlock_write(vma); > > + > > + goto retry; > > Should here be one cond_resched() here in case this function will take a really long time? > I think we will at most retry once. > > + } > > } > > > > static void > > @@ -474,11 +516,15 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, > > unsigned long v_start; > > unsigned long v_end; > > > > + if (!hugetlb_vma_trylock_write(vma)) > > + continue; > > + > > v_start = vma_offset_start(vma, start); > > v_end = vma_offset_end(vma, end); > > > > unmap_hugepage_range(vma, vma->vm_start + v_start, v_end, > > NULL, zap_flags); > > + hugetlb_vma_unlock_write(vma); > > } > > unmap_hugepage_range is not called under hugetlb_vma_lock in unmap_ref_private since it's private vma? > Add a comment to avoid future confusion? > > > } Sure, will add a comment before hugetlb_vma_lock. > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 6fb0bff2c7ee..5912c2b97ddf 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -4801,6 +4801,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > mmu_notifier_invalidate_range_start(&range); > > mmap_assert_write_locked(src); > > raw_write_seqcount_begin(&src->write_protect_seq); > > + } else { > > + /* > > + * For shared mappings the vma lock must be held before > > + * calling huge_pte_offset in the src vma. Otherwise, the > > s/huge_pte_offset/huge_pte_alloc/, i.e. huge_pte_alloc could return shared pmd, not huge_pte_offset which > might lead to confusion. But this is really trivial... Actually, it is huge_pte_offset. While looking up ptes in the source vma, we do not want to race with other threads in the source process which could be doing a huge_pmd_unshare. Otherwise, the returned pte could be invalid. FYI - Most of this code is now 'dead' because of bcd51a3c679d "Lazy page table copies in fork()". We will not copy shared mappigns at fork time. > > Except from above comments, this patch looks good to me. > Thank you! Thank you! Thank you! For looking at this series and all your comments. I hope to send out v2 next week. -- Mike Kravetz