On 2022/9/14 8:50, Mike Kravetz wrote: > On 09/13/22 10:14, Miaohe Lin wrote: >> On 2022/9/13 7:02, Mike Kravetz wrote: >>> On 09/05/22 11:08, Miaohe Lin wrote: >>>> On 2022/9/3 7:07, Mike Kravetz wrote: >>>>> On 08/30/22 10:02, Miaohe Lin wrote: >>>>>> On 2022/8/25 1:57, Mike Kravetz wrote: >>>>>>> The new hugetlb vma lock (rw semaphore) is used to address this race: >>>>>>> >>>>>>> Faulting thread Unsharing thread >>>>>>> ... ... >>>>>>> ptep = huge_pte_offset() >>>>>>> or >>>>>>> ptep = huge_pte_alloc() >>>>>>> ... >>>>>>> i_mmap_lock_write >>>>>>> lock page table >>>>>>> ptep invalid <------------------------ huge_pmd_unshare() >>>>>>> Could be in a previously unlock_page_table >>>>>>> sharing process or worse i_mmap_unlock_write >>>>>>> ... >>>>>>> >>>>>>> The vma_lock is used as follows: >>>>>>> - During fault processing. the lock is acquired in read mode before >>>>>>> doing a page table lock and allocation (huge_pte_alloc). The lock is >>>>>>> held until code is finished with the page table entry (ptep). >>>>>>> - The lock must be held in write mode whenever huge_pmd_unshare is >>>>>>> called. >>>>>>> >>>>>>> Lock ordering issues come into play when unmapping a page from all >>>>>>> vmas mapping the page. The i_mmap_rwsem must be held to search for the >>>>>>> vmas, and the vma lock must be held before calling unmap which will >>>>>>> call huge_pmd_unshare. This is done today in: >>>>>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory >>>>>>> error handling. In these routines we 'try' to obtain the vma lock and >>>>>>> fail to unmap if unsuccessful. Calling routines already deal with the >>>>>>> failure of unmapping. >>>>>>> - hugetlb_vmdelete_list for truncation and hole punch. This routine >>>>>>> also tries to acquire the vma lock. If it fails, it skips the >>>>>>> unmapping. However, we can not have file truncation or hole punch >>>>>>> fail because of contention. After hugetlb_vmdelete_list, truncation >>>>>>> and hole punch call remove_inode_hugepages. remove_inode_hugepages >>>>>>> check for mapped pages and call hugetlb_unmap_file_page to unmap them. >>>>>>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the >>>>>>> correct order to guarantee unmap success. >>>>>>> >>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >>>>>>> --- >>>>>>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++ >>>>>>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++---- >>>>>>> mm/memory.c | 2 + >>>>>>> mm/rmap.c | 100 +++++++++++++++++++++++++++--------------- >>>>>>> mm/userfaultfd.c | 9 +++- >>>>>>> 5 files changed, 214 insertions(+), 45 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>>>>>> index b93d131b0cb5..52d9b390389b 100644 >>>>>>> --- a/fs/hugetlbfs/inode.c >>>>>>> +++ b/fs/hugetlbfs/inode.c >>>>>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h, >>>>>>> struct folio *folio, pgoff_t index) >>>>>>> { >>>>>>> struct rb_root_cached *root = &mapping->i_mmap; >>>>>>> + unsigned long skipped_vm_start; >>>>>>> + struct mm_struct *skipped_mm; >>>>>>> struct page *page = &folio->page; >>>>>>> struct vm_area_struct *vma; >>>>>>> unsigned long v_start; >>>>>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h, >>>>>>> end = ((index + 1) * pages_per_huge_page(h)); >>>>>>> >>>>>>> i_mmap_lock_write(mapping); >>>>>>> +retry: >>>>>>> + skipped_mm = NULL; >>>>>>> >>>>>>> vma_interval_tree_foreach(vma, root, start, end - 1) { >>>>>>> v_start = vma_offset_start(vma, start); >>>>>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h, >>>>>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page)) >>>>>>> continue; >>>>>>> >>>>>>> + if (!hugetlb_vma_trylock_write(vma)) { >>>>>>> + /* >>>>>>> + * If we can not get vma lock, we need to drop >>>>>>> + * immap_sema and take locks in order. >>>>>>> + */ >>>>>>> + skipped_vm_start = vma->vm_start; >>>>>>> + skipped_mm = vma->vm_mm; >>>>>>> + /* grab mm-struct as we will be dropping i_mmap_sema */ >>>>>>> + mmgrab(skipped_mm); >>>>>>> + break; >>>>>>> + } >>>>>>> + >>>>>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end, >>>>>>> NULL, ZAP_FLAG_DROP_MARKER); >>>>>>> + hugetlb_vma_unlock_write(vma); >>>>>>> } >>>>>>> >>>>>>> i_mmap_unlock_write(mapping); >>>>>>> + >>>>>>> + if (skipped_mm) { >>>>>>> + mmap_read_lock(skipped_mm); >>>>>>> + vma = find_vma(skipped_mm, skipped_vm_start); >>>>>>> + if (!vma || !is_vm_hugetlb_page(vma) || >>>>>>> + vma->vm_file->f_mapping != mapping || >>>>>>> + vma->vm_start != skipped_vm_start) { >>>>>> >>>>>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway. >>>>>> >>>>> >>>>> Yes, that is missing. I will add here. >>>>> >>>>>>> + mmap_read_unlock(skipped_mm); >>>>>>> + mmdrop(skipped_mm); >>>>>>> + goto retry; >>>>>>> + } >>>>>>> + >>>>>> >>>>>> IMHO, above check is not enough. Think about the below scene: >>>>>> >>>>>> CPU 1 CPU 2 >>>>>> hugetlb_unmap_file_folio exit_mmap >>>>>> mmap_read_lock(skipped_mm); mmap_read_lock(mm); >>>>>> check vma is wanted. >>>>>> unmap_vmas >>>>>> mmap_read_unlock(skipped_mm); mmap_read_unlock >>>>>> mmap_write_lock(mm); >>>>>> free_pgtables >>>>>> remove_vma >>>>>> hugetlb_vma_lock_free >>>>>> vma, hugetlb_vma_lock is still *used after free* >>>>>> mmap_write_unlock(mm); >>>>>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something? >>>>> >>>>> In the retry case, we are OK because go back and look up the vma again. Right? >>>>> >>>>> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock. >>>>> Before that, we do the following: >>>>> >>>>>>> + hugetlb_vma_lock_write(vma); >>>>>>> + i_mmap_lock_write(mapping); >>>>> >>>>> IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we >>>> >>>> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be >>>> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race. >>>> >>>>> can. >>>>> >>>>>>> + mmap_read_unlock(skipped_mm); >>>>>>> + mmdrop(skipped_mm); >>>>> >>>>> We continue to hold i_mmap_lock_write as we goto retry. >>>>> >>>>> I could be missing something as well. This was how I intended to keep >>>>> vma valid while dropping and acquiring locks. >>>> >>>> Thanks for your clarifying. >>>> >>> >>> Well, that was all correct 'in theory' but not in practice. I did not take >>> into account the inode lock that is taken at the beginning of truncate (or >>> hole punch). In other code paths, we take inode lock after mmap_lock. So, >>> taking mmap_lock here is not allowed. >> >> Considering the Lock ordering in mm/filemap.c: >> >> * ->i_rwsem >> * ->invalidate_lock (acquired by fs in truncate path) >> * ->i_mmap_rwsem (truncate->unmap_mapping_range) >> >> * ->i_rwsem (generic_perform_write) >> * ->mmap_lock (fault_in_readable->do_page_fault) >> >> It seems inode_lock is taken before the mmap_lock? > > Hmmmm? I can't find a sequence where inode_lock is taken after mmap_lock. > lockdep was complaining about taking mmap_lock after i_rwsem in the above code. > I assumed there was such a sequence somewhere. Might need to go back and get > another trace/warning. Sorry, I'm somewhat confused. Take generic_file_write_iter() as an example: generic_file_write_iter inode_lock(inode); -- *inode lock is held here* __generic_file_write_iter generic_perform_write fault_in_iov_iter_readable -- *may cause page fault and thus take mmap_lock* inode_unlock(inode); This is the documented example in the mm/filemap.c. So we should take inode_lock before taking mmap_lock. Or this is out-dated ? And above example needs a fix? > > In any case, I think the scheme below is much cleaner. Doing another round of > benchmarking before sending. That should be a good alternative. Thanks for your work. :) Thanks, Miaohe Lin > >>> I came up with another way to make this work. As discussed above, we need to >>> drop the i_mmap lock before acquiring the vma_lock. However, once we drop >>> i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a >>> ref counted structure that can live on after the vma is freed. Therefore, >>> this code can take a reference while under i_mmap then drop i_mmap and wait >>> on the vma_lock. Of course, once it acquires the vma_lock it needs to check >>> and make sure the vma still exists. It may sound complicated, but I think >>> it is a bit simpler than the code here. A new series will be out soon. >>> >