On 08/24/22 10:57, Mike Kravetz wrote: > When page fault code needs to allocate and instantiate a new hugetlb > page (huegtlb_no_page), it checks early to determine if the fault is > beyond i_size. When discovered early, it is easy to abort the fault and > return an error. However, it becomes much more difficult to handle when > discovered later after allocating the page and consuming reservations > and adding to the page cache. Backing out changes in such instances > becomes difficult and error prone. > > Instead of trying to catch and backout all such races, use the hugetlb > fault mutex to handle truncate racing with page faults. The most > significant change is modification of the routine remove_inode_hugepages > such that it will take the fault mutex for EVERY index in the truncated > range (or hole in the case of hole punch). Since remove_inode_hugepages > is called in the truncate path after updating i_size, we can experience > races as follows. > - truncate code updates i_size and takes fault mutex before a racing > fault. After fault code takes mutex, it will notice fault beyond > i_size and abort early. > - fault code obtains mutex, and truncate updates i_size after early > checks in fault code. fault code will add page beyond i_size. > When truncate code takes mutex for page/index, it will remove the > page. > - truncate updates i_size, but fault code obtains mutex first. If > fault code sees updated i_size it will abort early. If fault code > does not see updated i_size, it will add page beyond i_size and > truncate code will remove page when it obtains fault mutex. > > Note, for performance reasons remove_inode_hugepages will still use > filemap_get_folios for bulk folio lookups. For indicies not returned in > the bulk lookup, it will need to lookup individual folios to check for > races with page fault. > <snip> > /* > * remove_inode_hugepages handles two distinct cases: truncation and hole > * punch. There are subtle differences in operation for each case. > @@ -418,11 +507,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, > * truncation is indicated by end of range being LLONG_MAX > * In this case, we first scan the range and release found pages. > * After releasing pages, hugetlb_unreserve_pages cleans up region/reserve > - * maps and global counts. Page faults can not race with truncation > - * in this routine. hugetlb_no_page() prevents page faults in the > - * truncated range. It checks i_size before allocation, and again after > - * with the page table lock for the page held. The same lock must be > - * acquired to unmap a page. > + * maps and global counts. Page faults can race with truncation. > + * During faults, hugetlb_no_page() checks i_size before page allocation, > + * and again after obtaining page table lock. It will 'back out' > + * allocations in the truncated range. > * hole punch is indicated if end is not LLONG_MAX > * In the hole punch case we scan the range and release found pages. > * Only when releasing a page is the associated region/reserve map > @@ -431,75 +519,69 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, > * This is indicated if we find a mapped page. > * Note: If the passed end of range value is beyond the end of file, but > * not LLONG_MAX this routine still performs a hole punch operation. > + * > + * Since page faults can race with this routine, care must be taken as both > + * modify huge page reservation data. To somewhat synchronize these operations > + * the hugetlb fault mutex is taken for EVERY index in the range to be hole > + * punched or truncated. In this way, we KNOW either: > + * - fault code has added a page beyond i_size, and we will remove here > + * - fault code will see updated i_size and not add a page beyond > + * The parameter 'lm__end' indicates the offset of the end of hole or file > + * before truncation. For hole punch lm_end == lend. > */ > static void remove_inode_hugepages(struct inode *inode, loff_t lstart, > - loff_t lend) > + loff_t lend, loff_t lm_end) > { > struct hstate *h = hstate_inode(inode); > struct address_space *mapping = &inode->i_data; > const pgoff_t start = lstart >> huge_page_shift(h); > const pgoff_t end = lend >> huge_page_shift(h); > + pgoff_t m_end = lm_end >> huge_page_shift(h); > + pgoff_t m_start, m_index; > struct folio_batch fbatch; > + struct folio *folio; > pgoff_t next, index; > - int i, freed = 0; > + unsigned int i; > + long freed = 0; > + u32 hash; > bool truncate_op = (lend == LLONG_MAX); > > folio_batch_init(&fbatch); > - next = start; > + next = m_start = start; > while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { > for (i = 0; i < folio_batch_count(&fbatch); ++i) { > - struct folio *folio = fbatch.folios[i]; > - u32 hash = 0; > + folio = fbatch.folios[i]; > > index = folio->index; > - hash = hugetlb_fault_mutex_hash(mapping, index); > - mutex_lock(&hugetlb_fault_mutex_table[hash]); > - > /* > - * If folio is mapped, it was faulted in after being > - * unmapped in caller. Unmap (again) now after taking > - * the fault mutex. The mutex will prevent faults > - * until we finish removing the folio. > - * > - * This race can only happen in the hole punch case. > - * Getting here in a truncate operation is a bug. > + * Take fault mutex for missing folios before index, > + * while checking folios that might have been added > + * due to a race with fault code. > */ > - if (unlikely(folio_mapped(folio))) { > - BUG_ON(truncate_op); > - > - i_mmap_lock_write(mapping); > - hugetlb_vmdelete_list(&mapping->i_mmap, > - index * pages_per_huge_page(h), > - (index + 1) * pages_per_huge_page(h), > - ZAP_FLAG_DROP_MARKER); > - i_mmap_unlock_write(mapping); > - } > + freed += fault_lock_inode_indicies(h, inode, mapping, > + m_start, m_index, truncate_op); This should be 'index' instead of 'm_index' as discovered here: https://lore.kernel.org/linux-mm/CA+G9fYsHVdu0toduQqk6vsR8Z8mOVzZ9-_p3O5fjQ5mOpSxsDA@xxxxxxxxxxxxxx/ -- Mike Kravetz