On 10/25/18 5:42 PM, Naoya Horiguchi wrote: > Hi Mike, > > On Tue, Oct 23, 2018 at 09:50:53PM -0700, Mike Kravetz wrote: >> Now, anopther task truncates the hugetlbfs file. As part of truncation, >> it unmaps everyone who has the file mapped. If a task has a shared pmd >> in this range, huge_pmd_unshhare will be called. If this is not the last > > (sorry, nitpicking ..) a few typos ("anophter" and "unshhare"). Hi Naoya, Thanks for looking at the patch. I put this together somewhat quickly before traveling and unfortunately made several typos. Wanted to provide adequate documentation to help understand the changes. >> user sharing the pmd, huge_pmd_unshare will clear pud pointing to the >> pmd. For the task in the middle of the page fault, the ptep returned by >> huge_pte_alloc points to another task's page table or worse. This leads >> to bad things such as incorrect page map/reference counts or invalid >> memory references. >> >> i_mmap_rwsem is currently used for pmd sharing synchronization. It is also >> held during unmap and whenever a call to huge_pmd_unshare is possible. It >> is only acquired in write mode. Expand and modify the use of i_mmap_rwsem >> as follows: >> - i_mmap_rwsem is held in write mode for the duration of truncate >> processing. >> - i_mmap_rwsem is held in write mode whenever huge_pmd_share is called. > > I guess you mean huge_pmd_unshare here, right? > Correct, i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. >> - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. >> Today that is only via huge_pte_alloc. >> - i_mmap_rwsem is held in read mode after huge_pte_alloc, until the caller >> is finished with the returned ptep. >> >> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> [...] >> @@ -505,8 +512,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) >> i_mmap_lock_write(mapping); >> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) >> hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); >> - i_mmap_unlock_write(mapping); >> remove_inode_hugepages(inode, offset, LLONG_MAX); >> + i_mmap_unlock_write(mapping); > > I just have an impression that hugetlbfs_punch_hole() could have the > similar race and extending lock range there could be an improvement, > although I might miss something as always. > You are correct. The hole punch routine (hugetlbfs_punch_hole) should continue to hold i_mmap_rwsem in write mode until after calling remove_inode_hugepages. >> return 0; >> } >> >> @@ -624,7 +631,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, >> /* addr is the offset within the file (zero based) */ >> addr = index * hpage_size; >> >> - /* mutex taken here, fault path and hole punch */ >> + /* >> + * fault mutex taken here, protects against fault path >> + * and hole punch. inode_lock previously taken protects >> + * against truncation. >> + */ >> hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping, >> index, addr); >> mutex_lock(&hugetlb_fault_mutex_table[hash]); >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 7b5c0ad9a6bd..e9da3eee262f 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3252,18 +3252,33 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, >> >> for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { >> spinlock_t *src_ptl, *dst_ptl; >> + struct vm_area_struct *dst_vma; >> + struct address_space *mapping; >> + >> src_pte = huge_pte_offset(src, addr, sz); >> if (!src_pte) >> continue; >> + >> + /* >> + * i_mmap_rwsem must be held to call huge_pte_alloc. >> + * Continue to hold until finished with dst_pte, otherwise >> + * it could go away if part of a shared pmd. >> + */ >> + dst_vma = find_vma(dst, addr); >> + mapping = dst_vma->vm_file->f_mapping; > > If vma->vm_file->f_mapping gives the same mapping, you may omit the find_vma()? > Thanks. You are correct. 'dst_vma' should be the same as vma as it is a copy. This find is unnecessary. >> + i_mmap_lock_read(mapping); >> dst_pte = huge_pte_alloc(dst, addr, sz); >> if (!dst_pte) { >> + i_mmap_unlock_read(mapping); >> ret = -ENOMEM; >> break; >> } >> >> /* If the pagetables are shared don't copy or take references */ >> - if (dst_pte == src_pte) >> + if (dst_pte == src_pte) { >> + i_mmap_unlock_read(mapping); >> continue; >> + } >> >> dst_ptl = huge_pte_lock(h, dst, dst_pte); >> src_ptl = huge_pte_lockptr(h, src, src_pte); > > [...] > >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 1e79fac3186b..db49e734dda8 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1347,6 +1347,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> bool ret = true; >> unsigned long start = address, end; >> enum ttu_flags flags = (enum ttu_flags)arg; >> + bool pmd_sharing_possible = false; >> >> /* munlock has nothing to gain from examining un-locked vmas */ >> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) >> @@ -1376,8 +1377,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> * accordingly. >> */ >> adjust_range_if_pmd_sharing_possible(vma, &start, &end); >> + if ((end - start) > (PAGE_SIZE << compound_order(page))) >> + pmd_sharing_possible = true; > > Maybe the similar check is done in adjust_range_if_pmd_sharing_possible() > as the function name claims, so does it make more sense to get this bool > value via the return value? Yes, that makes sense. This use of adjust_range_if_pmd_sharing_possible would be the only place a return value is used. Thanks for your comments! I am concerned about the use of any huge pte pointers when the page table lock or i_mmap_rwsem is not held. There may be more instances of this we need to protect. For example, huge_pte_offset at the beginning of hugetlb_fault() is still called without any synchronization. I think we may need to acquire i_mmap_rwsem before this call. I'm trying to think of other areas that may be of concern. -- Mike Kravetz