Most hugetlb fault handling code checks for faults beyond i_size. While there are early checks in the code paths, the most difficult to handle are those discovered after taking the page table lock. At this point, we have possibly allocated a page and consumed associated reservations and possibly added the page to the page cache. When discovering a fault beyond i_size, be sure to: - Remove the page from page cache, else it will sit there until the file is removed. - Do not restore any reservation for the page consumed. Otherwise there will be an outstanding reservation for an offset beyond the end of file. The 'truncation' code in remove_inode_hugepages must deal with fault code potentially removing a page/folio from the cache after the page was returned by filemap_get_folios and before locking the page. This can be discovered by a change in folio_mapping() after taking folio lock. In addition, this code must deal with fault code potentially consuming and returning reservations. To synchronize this, remove_inode_hugepages will now take the fault mutex for ALL indices in the hole or truncated range. In this way, it KNOWS fault code has finished with the page/index OR fault code will see the updated file size. Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> --- fs/hugetlbfs/inode.c | 88 ++++++++++++++++++++++++++++++-------------- mm/hugetlb.c | 39 +++++++++++++++----- 2 files changed, 90 insertions(+), 37 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a878c672cf6d..31bd4325fce5 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -443,11 +443,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, * truncation is indicated by end of range being LLONG_MAX * In this case, we first scan the range and release found pages. * After releasing pages, hugetlb_unreserve_pages cleans up region/reserve - * maps and global counts. Page faults can not race with truncation - * in this routine. hugetlb_no_page() prevents page faults in the - * truncated range. It checks i_size before allocation, and again after - * with the page table lock for the page held. The same lock must be - * acquired to unmap a page. + * maps and global counts. Page faults can race with truncation. + * During faults, hugetlb_no_page() checks i_size before page allocation, + * and again after obtaining page table lock. It will 'back out' + * allocations in the truncated range. * hole punch is indicated if end is not LLONG_MAX * In the hole punch case we scan the range and release found pages. * Only when releasing a page is the associated region/reserve map @@ -456,27 +455,46 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, * This is indicated if we find a mapped page. * Note: If the passed end of range value is beyond the end of file, but * not LLONG_MAX this routine still performs a hole punch operation. + * + * Since page faults can race with this routine, care must be taken as both + * modify huge page reservation data. To somewhat synchronize these operations + * the hugetlb fault mutex is taken for EVERY index in the range to be hole + * punched or truncated. In this way, we KNOW fault code will either have + * completed backout operations under the mutex, or fault code will see the + * updated file size and not allocate a page for offsets beyond truncated size. + * The parameter 'lm__end' indicates the offset of the end of hole or file + * before truncation. For hole punch lm_end == lend. */ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, - loff_t lend) + loff_t lend, loff_t lm_end) { struct hstate *h = hstate_inode(inode); struct address_space *mapping = &inode->i_data; const pgoff_t start = lstart >> huge_page_shift(h); const pgoff_t end = lend >> huge_page_shift(h); + pgoff_t m_end = lm_end >> huge_page_shift(h); + pgoff_t m_start, m_index; struct folio_batch fbatch; pgoff_t next, index; int i, freed = 0; + u32 hash; bool truncate_op = (lend == LLONG_MAX); folio_batch_init(&fbatch); - next = start; + next = m_start = start; while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { for (i = 0; i < folio_batch_count(&fbatch); ++i) { struct folio *folio = fbatch.folios[i]; - u32 hash = 0; index = folio->index; + /* Take fault mutex for missing folios before index */ + for (m_index = m_start; m_index < index; m_index++) { + hash = hugetlb_fault_mutex_hash(mapping, + m_index); + mutex_lock(&hugetlb_fault_mutex_table[hash]); + mutex_unlock(&hugetlb_fault_mutex_table[hash]); + } + m_start = index + 1; hash = hugetlb_fault_mutex_hash(mapping, index); mutex_lock(&hugetlb_fault_mutex_table[hash]); @@ -485,13 +503,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, * unmapped in caller. Unmap (again) now after taking * the fault mutex. The mutex will prevent faults * until we finish removing the folio. - * - * This race can only happen in the hole punch case. - * Getting here in a truncate operation is a bug. */ if (unlikely(folio_mapped(folio))) { - BUG_ON(truncate_op); - i_mmap_lock_write(mapping); hugetlb_vmdelete_list(&mapping->i_mmap, index * pages_per_huge_page(h), @@ -502,20 +515,30 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, folio_lock(folio); /* - * We must free the huge page and remove from page - * cache BEFORE removing the region/reserve map - * (hugetlb_unreserve_pages). In rare out of memory - * conditions, removal of the region/reserve map could - * fail. Correspondingly, the subpool and global - * reserve usage count can need to be adjusted. + * After locking page, make sure mapping is the same. + * We could have raced with page fault populate and + * backout code. */ - VM_BUG_ON(HPageRestoreReserve(&folio->page)); - hugetlb_delete_from_page_cache(&folio->page); - freed++; - if (!truncate_op) { - if (unlikely(hugetlb_unreserve_pages(inode, + if (folio_mapping(folio) == mapping) { + /* + * We must free the folio and remove from + * page cache BEFORE removing the region/ + * reserve map (hugetlb_unreserve_pages). In + * rare out of memory conditions, removal of + * the region/reserve map could fail. + * Correspondingly, the subpool and global + * reserve usage count can need to be adjusted. + */ + VM_BUG_ON(HPageRestoreReserve(&folio->page)); + hugetlb_delete_from_page_cache(&folio->page); + freed++; + if (!truncate_op) { + if (unlikely( + hugetlb_unreserve_pages(inode, index, index + 1, 1))) - hugetlb_fix_reserve_counts(inode); + hugetlb_fix_reserve_counts( + inode); + } } folio_unlock(folio); @@ -525,6 +548,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, cond_resched(); } + /* Take fault mutex for missing folios at end of range */ + for (m_index = m_start; m_index < m_end; m_index++) { + hash = hugetlb_fault_mutex_hash(mapping, m_index); + mutex_lock(&hugetlb_fault_mutex_table[hash]); + mutex_unlock(&hugetlb_fault_mutex_table[hash]); + } + if (truncate_op) (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed); } @@ -532,8 +562,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, static void hugetlbfs_evict_inode(struct inode *inode) { struct resv_map *resv_map; + loff_t prev_size = i_size_read(inode); - remove_inode_hugepages(inode, 0, LLONG_MAX); + remove_inode_hugepages(inode, 0, LLONG_MAX, prev_size); /* * Get the resv_map from the address space embedded in the inode. @@ -553,6 +584,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset) pgoff_t pgoff; struct address_space *mapping = inode->i_mapping; struct hstate *h = hstate_inode(inode); + loff_t prev_size = i_size_read(inode); BUG_ON(offset & ~huge_page_mask(h)); pgoff = offset >> PAGE_SHIFT; @@ -563,7 +595,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset) hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0, ZAP_FLAG_DROP_MARKER); i_mmap_unlock_write(mapping); - remove_inode_hugepages(inode, offset, LLONG_MAX); + remove_inode_hugepages(inode, offset, LLONG_MAX, prev_size); } static void hugetlbfs_zero_partial_page(struct hstate *h, @@ -635,7 +667,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) /* Remove full pages from the file. */ if (hole_end > hole_start) - remove_inode_hugepages(inode, hole_start, hole_end); + remove_inode_hugepages(inode, hole_start, hole_end, hole_end); inode_unlock(inode); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a9f320c676e4..25f644a3a981 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5491,6 +5491,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, spinlock_t *ptl; unsigned long haddr = address & huge_page_mask(h); bool new_page, new_pagecache_page = false; + bool beyond_i_size = false; + bool reserve_alloc = false; /* * Currently, we are forced to kill the process in the event the @@ -5548,6 +5550,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); new_page = true; + if (HPageRestoreReserve(page)) + reserve_alloc = true; if (vma->vm_flags & VM_MAYSHARE) { int err = hugetlb_add_to_page_cache(page, mapping, idx); @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, ptl = huge_pte_lock(h, mm, ptep); size = i_size_read(mapping->host) >> huge_page_shift(h); - if (idx >= size) + if (idx >= size) { + beyond_i_size = true; goto backout; + } ret = 0; /* If pte changed from under us, retry */ @@ -5652,10 +5658,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, backout: spin_unlock(ptl); backout_unlocked: + if (new_page) { + if (new_pagecache_page) + hugetlb_delete_from_page_cache(page); + + /* + * If reserve was consumed, make sure flag is set so that it + * will be restored in free_huge_page(). + */ + if (reserve_alloc) + SetHPageRestoreReserve(page); + + /* + * Do not restore reserve map entries beyond i_size. + * Otherwise, there will be leaks when the file is removed. + */ + if (!beyond_i_size) + restore_reserve_on_error(h, vma, haddr, page); + } unlock_page(page); - /* restore reserve for newly allocated pages not in page cache */ - if (new_page && !new_pagecache_page) - restore_reserve_on_error(h, vma, haddr, page); put_page(page); goto out; } @@ -5975,15 +5996,15 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, * Recheck the i_size after holding PT lock to make sure not * to leave any page mapped (as page_mapped()) beyond the end * of the i_size (remove_inode_hugepages() is strict about - * enforcing that). If we bail out here, we'll also leave a - * page in the radix tree in the vm_shared case beyond the end - * of the i_size, but remove_inode_hugepages() will take care - * of it as soon as we drop the hugetlb_fault_mutex_table. + * enforcing that). If we bail out here, remove the page + * added to the radix tree. */ size = i_size_read(mapping->host) >> huge_page_shift(h); ret = -EFAULT; - if (idx >= size) + if (idx >= size) { + hugetlb_delete_from_page_cache(page); goto out_release_unlock; + } ret = -EEXIST; /* -- 2.35.3