On 09/13/22 10:14, Miaohe Lin wrote: > On 2022/9/13 7:02, Mike Kravetz wrote: > > On 09/05/22 11:08, Miaohe Lin wrote: > >> On 2022/9/3 7:07, Mike Kravetz wrote: > >>> On 08/30/22 10:02, Miaohe Lin wrote: > >>>> On 2022/8/25 1:57, Mike Kravetz wrote: > >>>>> The new hugetlb vma lock (rw semaphore) is used to address this race: > >>>>> > >>>>> Faulting thread Unsharing thread > >>>>> ... ... > >>>>> ptep = huge_pte_offset() > >>>>> or > >>>>> ptep = huge_pte_alloc() > >>>>> ... > >>>>> i_mmap_lock_write > >>>>> lock page table > >>>>> ptep invalid <------------------------ huge_pmd_unshare() > >>>>> Could be in a previously unlock_page_table > >>>>> sharing process or worse i_mmap_unlock_write > >>>>> ... > >>>>> > >>>>> The vma_lock is used as follows: > >>>>> - During fault processing. the lock is acquired in read mode before > >>>>> doing a page table lock and allocation (huge_pte_alloc). The lock is > >>>>> held until code is finished with the page table entry (ptep). > >>>>> - The lock must be held in write mode whenever huge_pmd_unshare is > >>>>> called. > >>>>> > >>>>> Lock ordering issues come into play when unmapping a page from all > >>>>> vmas mapping the page. The i_mmap_rwsem must be held to search for the > >>>>> vmas, and the vma lock must be held before calling unmap which will > >>>>> call huge_pmd_unshare. This is done today in: > >>>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory > >>>>> error handling. In these routines we 'try' to obtain the vma lock and > >>>>> fail to unmap if unsuccessful. Calling routines already deal with the > >>>>> failure of unmapping. > >>>>> - hugetlb_vmdelete_list for truncation and hole punch. This routine > >>>>> also tries to acquire the vma lock. If it fails, it skips the > >>>>> unmapping. However, we can not have file truncation or hole punch > >>>>> fail because of contention. After hugetlb_vmdelete_list, truncation > >>>>> and hole punch call remove_inode_hugepages. remove_inode_hugepages > >>>>> check for mapped pages and call hugetlb_unmap_file_page to unmap them. > >>>>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the > >>>>> correct order to guarantee unmap success. > >>>>> > >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > >>>>> --- > >>>>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++ > >>>>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++---- > >>>>> mm/memory.c | 2 + > >>>>> mm/rmap.c | 100 +++++++++++++++++++++++++++--------------- > >>>>> mm/userfaultfd.c | 9 +++- > >>>>> 5 files changed, 214 insertions(+), 45 deletions(-) > >>>>> > >>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > >>>>> index b93d131b0cb5..52d9b390389b 100644 > >>>>> --- a/fs/hugetlbfs/inode.c > >>>>> +++ b/fs/hugetlbfs/inode.c > >>>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h, > >>>>> struct folio *folio, pgoff_t index) > >>>>> { > >>>>> struct rb_root_cached *root = &mapping->i_mmap; > >>>>> + unsigned long skipped_vm_start; > >>>>> + struct mm_struct *skipped_mm; > >>>>> struct page *page = &folio->page; > >>>>> struct vm_area_struct *vma; > >>>>> unsigned long v_start; > >>>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h, > >>>>> end = ((index + 1) * pages_per_huge_page(h)); > >>>>> > >>>>> i_mmap_lock_write(mapping); > >>>>> +retry: > >>>>> + skipped_mm = NULL; > >>>>> > >>>>> vma_interval_tree_foreach(vma, root, start, end - 1) { > >>>>> v_start = vma_offset_start(vma, start); > >>>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h, > >>>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page)) > >>>>> continue; > >>>>> > >>>>> + if (!hugetlb_vma_trylock_write(vma)) { > >>>>> + /* > >>>>> + * If we can not get vma lock, we need to drop > >>>>> + * immap_sema and take locks in order. > >>>>> + */ > >>>>> + skipped_vm_start = vma->vm_start; > >>>>> + skipped_mm = vma->vm_mm; > >>>>> + /* grab mm-struct as we will be dropping i_mmap_sema */ > >>>>> + mmgrab(skipped_mm); > >>>>> + break; > >>>>> + } > >>>>> + > >>>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end, > >>>>> NULL, ZAP_FLAG_DROP_MARKER); > >>>>> + hugetlb_vma_unlock_write(vma); > >>>>> } > >>>>> > >>>>> i_mmap_unlock_write(mapping); > >>>>> + > >>>>> + if (skipped_mm) { > >>>>> + mmap_read_lock(skipped_mm); > >>>>> + vma = find_vma(skipped_mm, skipped_vm_start); > >>>>> + if (!vma || !is_vm_hugetlb_page(vma) || > >>>>> + vma->vm_file->f_mapping != mapping || > >>>>> + vma->vm_start != skipped_vm_start) { > >>>> > >>>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway. > >>>> > >>> > >>> Yes, that is missing. I will add here. > >>> > >>>>> + mmap_read_unlock(skipped_mm); > >>>>> + mmdrop(skipped_mm); > >>>>> + goto retry; > >>>>> + } > >>>>> + > >>>> > >>>> IMHO, above check is not enough. Think about the below scene: > >>>> > >>>> CPU 1 CPU 2 > >>>> hugetlb_unmap_file_folio exit_mmap > >>>> mmap_read_lock(skipped_mm); mmap_read_lock(mm); > >>>> check vma is wanted. > >>>> unmap_vmas > >>>> mmap_read_unlock(skipped_mm); mmap_read_unlock > >>>> mmap_write_lock(mm); > >>>> free_pgtables > >>>> remove_vma > >>>> hugetlb_vma_lock_free > >>>> vma, hugetlb_vma_lock is still *used after free* > >>>> mmap_write_unlock(mm); > >>>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something? > >>> > >>> In the retry case, we are OK because go back and look up the vma again. Right? > >>> > >>> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock. > >>> Before that, we do the following: > >>> > >>>>> + hugetlb_vma_lock_write(vma); > >>>>> + i_mmap_lock_write(mapping); > >>> > >>> IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we > >> > >> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be > >> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race. > >> > >>> can. > >>> > >>>>> + mmap_read_unlock(skipped_mm); > >>>>> + mmdrop(skipped_mm); > >>> > >>> We continue to hold i_mmap_lock_write as we goto retry. > >>> > >>> I could be missing something as well. This was how I intended to keep > >>> vma valid while dropping and acquiring locks. > >> > >> Thanks for your clarifying. > >> > > > > Well, that was all correct 'in theory' but not in practice. I did not take > > into account the inode lock that is taken at the beginning of truncate (or > > hole punch). In other code paths, we take inode lock after mmap_lock. So, > > taking mmap_lock here is not allowed. > > Considering the Lock ordering in mm/filemap.c: > > * ->i_rwsem > * ->invalidate_lock (acquired by fs in truncate path) > * ->i_mmap_rwsem (truncate->unmap_mapping_range) > > * ->i_rwsem (generic_perform_write) > * ->mmap_lock (fault_in_readable->do_page_fault) > > It seems inode_lock is taken before the mmap_lock? Hmmmm? I can't find a sequence where inode_lock is taken after mmap_lock. lockdep was complaining about taking mmap_lock after i_rwsem in the above code. I assumed there was such a sequence somewhere. Might need to go back and get another trace/warning. In any case, I think the scheme below is much cleaner. Doing another round of benchmarking before sending. > > I came up with another way to make this work. As discussed above, we need to > > drop the i_mmap lock before acquiring the vma_lock. However, once we drop > > i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a > > ref counted structure that can live on after the vma is freed. Therefore, > > this code can take a reference while under i_mmap then drop i_mmap and wait > > on the vma_lock. Of course, once it acquires the vma_lock it needs to check > > and make sure the vma still exists. It may sound complicated, but I think > > it is a bit simpler than the code here. A new series will be out soon. > > -- Mike Kravetz