Hi Mike, On Tue, Oct 23, 2018 at 09:50:53PM -0700, Mike Kravetz wrote: > hugetlbfs does not correctly handle page faults racing with truncation. > In addition, shared pmds can cause additional issues. > > Without pmd sharing, issues can occur as follows: > A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages. At > mmap time, 4 huge pages are reserved for the file/mapping. So, > the global reserve count is 4. In addition, since this is a shared > mapping an entry for 4 pages is added to the file's reserve map. > The first 3 of the 4 pages are faulted into the file. As a result, > the global reserve count is now 1. > > Task A starts to fault in the last page (routines hugetlb_fault, > hugetlb_no_page). It allocates a huge page (alloc_huge_page). > The reserve map indicates there is a reserved page, so this is > used and the global reserve count goes to 0. > > Now, task B truncates the file to size 0. It starts by setting > inode size to 0(hugetlb_vmtruncate). It then unmaps all mapping > of the file (hugetlb_vmdelete_list). Since task A's page table > lock is not held at the time, truncation is not blocked. Truncation > removes the 3 pages from the file (remove_inode_hugepages). When > cleaning up the reserved pages (hugetlb_unreserve_pages), it notices > the reserve map was for 4 pages. However, it has only freed 3 pages. > So it assumes there is still (4 - 3) 1 reserved pages. It then > decrements the global reserve count by 1 and it goes negative. > > Task A then continues the page fault process and adds it's newly > acquired page to the page cache. Note that the index of this page > is beyond the size of the truncated file (0). The page fault process > then notices the file has been truncated and exits. However, the > page is left in the cache associated with the file. > > Now, if the file is immediately deleted the truncate code runs again. > It will find and free the one page associated with the file. When > cleaning up reserves, it notices the reserve map is empty. Yet, one > page freed. So, the global reserve count is decremented by (0 - 1) -1. > This returns the global count to 0 as it should be. But, it is > possible for someone else to mmap this file/range before it is deleted. > If this happens, a reserve map entry for the allocated page is created > and the reserved page is forever leaked. > > With pmd sharing, the situation is even worse. Consider the following: > A task processes a page fault on a shared hugetlbfs file and calls > huge_pte_alloc to get a ptep. Suppose the returned ptep points to a > shared pmd. > > Now, anopther task truncates the hugetlbfs file. As part of truncation, > it unmaps everyone who has the file mapped. If a task has a shared pmd > in this range, huge_pmd_unshhare will be called. If this is not the last (sorry, nitpicking ..) a few typos ("anophter" and "unshhare"). > user sharing the pmd, huge_pmd_unshare will clear pud pointing to the > pmd. For the task in the middle of the page fault, the ptep returned by > huge_pte_alloc points to another task's page table or worse. This leads > to bad things such as incorrect page map/reference counts or invalid > memory references. > > i_mmap_rwsem is currently used for pmd sharing synchronization. It is also > held during unmap and whenever a call to huge_pmd_unshare is possible. It > is only acquired in write mode. Expand and modify the use of i_mmap_rwsem > as follows: > - i_mmap_rwsem is held in write mode for the duration of truncate > processing. > - i_mmap_rwsem is held in write mode whenever huge_pmd_share is called. I guess you mean huge_pmd_unshare here, right? > - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. > Today that is only via huge_pte_alloc. > - i_mmap_rwsem is held in read mode after huge_pte_alloc, until the caller > is finished with the returned ptep. > > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > --- > fs/hugetlbfs/inode.c | 21 ++++++++++---- > mm/hugetlb.c | 65 +++++++++++++++++++++++++++++++++----------- > mm/rmap.c | 10 +++++++ > mm/userfaultfd.c | 11 ++++++-- > 4 files changed, 84 insertions(+), 23 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index 32920a10100e..6ee97622a231 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -426,10 +426,16 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, > u32 hash; > > index = page->index; > - hash = hugetlb_fault_mutex_hash(h, current->mm, > + /* > + * No need to take fault mutex for truncation as we > + * are synchronized via i_mmap_rwsem. > + */ > + if (!truncate_op) { > + hash = hugetlb_fault_mutex_hash(h, current->mm, > &pseudo_vma, > mapping, index, 0); > - mutex_lock(&hugetlb_fault_mutex_table[hash]); > + mutex_lock(&hugetlb_fault_mutex_table[hash]); > + } > > /* > * If page is mapped, it was faulted in after being > @@ -470,7 +476,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, > } > > unlock_page(page); > - mutex_unlock(&hugetlb_fault_mutex_table[hash]); > + if (!truncate_op) > + mutex_unlock(&hugetlb_fault_mutex_table[hash]); > } > huge_pagevec_release(&pvec); > cond_resched(); > @@ -505,8 +512,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) > i_mmap_lock_write(mapping); > if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) > hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); > - i_mmap_unlock_write(mapping); > remove_inode_hugepages(inode, offset, LLONG_MAX); > + i_mmap_unlock_write(mapping); I just have an impression that hugetlbfs_punch_hole() could have the similar race and extending lock range there could be an improvement, although I might miss something as always. > return 0; > } > > @@ -624,7 +631,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, > /* addr is the offset within the file (zero based) */ > addr = index * hpage_size; > > - /* mutex taken here, fault path and hole punch */ > + /* > + * fault mutex taken here, protects against fault path > + * and hole punch. inode_lock previously taken protects > + * against truncation. > + */ > hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping, > index, addr); > mutex_lock(&hugetlb_fault_mutex_table[hash]); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 7b5c0ad9a6bd..e9da3eee262f 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3252,18 +3252,33 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { > spinlock_t *src_ptl, *dst_ptl; > + struct vm_area_struct *dst_vma; > + struct address_space *mapping; > + > src_pte = huge_pte_offset(src, addr, sz); > if (!src_pte) > continue; > + > + /* > + * i_mmap_rwsem must be held to call huge_pte_alloc. > + * Continue to hold until finished with dst_pte, otherwise > + * it could go away if part of a shared pmd. > + */ > + dst_vma = find_vma(dst, addr); > + mapping = dst_vma->vm_file->f_mapping; If vma->vm_file->f_mapping gives the same mapping, you may omit the find_vma()? > + i_mmap_lock_read(mapping); > dst_pte = huge_pte_alloc(dst, addr, sz); > if (!dst_pte) { > + i_mmap_unlock_read(mapping); > ret = -ENOMEM; > break; > } > > /* If the pagetables are shared don't copy or take references */ > - if (dst_pte == src_pte) > + if (dst_pte == src_pte) { > + i_mmap_unlock_read(mapping); > continue; > + } > > dst_ptl = huge_pte_lock(h, dst, dst_pte); > src_ptl = huge_pte_lockptr(h, src, src_pte); [...] > diff --git a/mm/rmap.c b/mm/rmap.c > index 1e79fac3186b..db49e734dda8 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1347,6 +1347,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > bool ret = true; > unsigned long start = address, end; > enum ttu_flags flags = (enum ttu_flags)arg; > + bool pmd_sharing_possible = false; > > /* munlock has nothing to gain from examining un-locked vmas */ > if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) > @@ -1376,8 +1377,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > * accordingly. > */ > adjust_range_if_pmd_sharing_possible(vma, &start, &end); > + if ((end - start) > (PAGE_SIZE << compound_order(page))) > + pmd_sharing_possible = true; Maybe the similar check is done in adjust_range_if_pmd_sharing_possible() as the function name claims, so does it make more sense to get this bool value via the return value? Thanks, Naoya Horiguchi