On Wed, Nov 25, 2020 at 3:09 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, 25 Nov 2020 02:32:34 +0000 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > On Tue, Nov 17, 2020 at 11:43:02PM +0000, Matthew Wilcox wrote: > > > On Tue, Nov 17, 2020 at 07:15:13PM +0000, Matthew Wilcox wrote: > > > > I find both of these functions exceptionally confusing. Does this > > > > make it easier to understand? > > > > > > Never mind, this is buggy. I'll send something better tomorrow. > > > > That took a week, not a day. *sigh*. At least this is shorter. > > > > commit 1a02863ce04fd325922d6c3db6d01e18d55f966b > > Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> > > Date: Tue Nov 17 10:45:18 2020 -0500 > > > > fix mm-truncateshmem-handle-truncates-that-split-thps.patch > > > > That's a big patch. Big enough to put > mm-truncateshmem-handle-truncates-that-split-thps.patch back into > unreviewed state, methinks. And big enough to have a changelog! > > Below is the folded-together result for reviewers, please. I have not reviewed the changes at all, but have been testing. Responding hastily in gmail, which will probably garble the result (sorry): because I think you may be working towards another mmotm, and there's one little fix definitely needed, but the machine I usually mail patches from is in a different hang (running with this patch) that I need to examine before rebooting - but probably not something that the average user will ever encounter. In general, this series behaves a lot better than it did nine days ago: LTP results on shmem huge pages match how they were before the series, only one of xfstests fails which did not fail before (generic/539 - not yet analysed, may be of no importance), and until that hang there were no problems seen in running my tmpfs swapping loads. Though I did see generic/476 stuck in shmem_undo_range() on one machine, and will need to reproduce and investigate that. The little fix definitely needed was shown by generic/083: each fsstress waiting for page lock, happens even without forcing huge pages. See below... > Is the changelog still accurate and complete? > > > From: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> > Subject: mm/truncate,shmem: handle truncates that split THPs > > Handle THP splitting in the parts of the truncation functions which > already handle partial pages. Factor all that code out into a new > function called truncate_inode_partial_page(). > > We lose the easy 'bail out' path if a truncate or hole punch is entirely > within a single page. We can add some more complex logic to restore the > optimisation if it proves to be worthwhile. > > [willy@xxxxxxxxxxxxx: fix] > Link: https://lkml.kernel.org/r/20201125023234.GH4327@xxxxxxxxxxxxxxxxxxxx > Link: https://lkml.kernel.org/r/20201112212641.27837-16-willy@xxxxxxxxxxxxx > Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> > Reviewed-by: Jan Kara <jack@xxxxxxx> > Reviewed-by: William Kucharski <william.kucharski@xxxxxxxxxx> > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Dave Chinner <dchinner@xxxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > Cc: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > --- > > mm/internal.h | 2 > mm/shmem.c | 137 +++++++++++++---------------------------- > mm/truncate.c | 157 +++++++++++++++++++++++------------------------- > 3 files changed, 122 insertions(+), 174 deletions(-) > > --- a/mm/internal.h~mm-truncateshmem-handle-truncates-that-split-thps > +++ a/mm/internal.h > @@ -623,4 +623,6 @@ struct migration_target_control { > gfp_t gfp_mask; > }; > > +pgoff_t truncate_inode_partial_page(struct address_space *mapping, > + struct page *page, loff_t start, loff_t end); > #endif /* __MM_INTERNAL_H */ > --- a/mm/shmem.c~mm-truncateshmem-handle-truncates-that-split-thps > +++ a/mm/shmem.c > @@ -858,32 +858,6 @@ void shmem_unlock_mapping(struct address > } > > /* > - * Check whether a hole-punch or truncation needs to split a huge page, > - * returning true if no split was required, or the split has been successful. > - * > - * Eviction (or truncation to 0 size) should never need to split a huge page; > - * but in rare cases might do so, if shmem_undo_range() failed to trylock on > - * head, and then succeeded to trylock on tail. > - * > - * A split can only succeed when there are no additional references on the > - * huge page: so the split below relies upon find_get_entries() having stopped > - * when it found a subpage of the huge page, without getting further references. > - */ > -static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end) > -{ > - if (!PageTransCompound(page)) > - return true; > - > - /* Just proceed to delete a huge page wholly within the range punched */ > - if (PageHead(page) && > - page->index >= start && page->index + HPAGE_PMD_NR <= end) > - return true; > - > - /* Try to split huge page, so we can truly punch the hole or truncate */ > - return split_huge_page(page) >= 0; > -} > - > -/* > * Remove range of pages and swap entries from page cache, and free them. > * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. > */ > @@ -892,26 +866,33 @@ static void shmem_undo_range(struct inod > { > struct address_space *mapping = inode->i_mapping; > struct shmem_inode_info *info = SHMEM_I(inode); > - pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT; > - pgoff_t end = (lend + 1) >> PAGE_SHIFT; > - unsigned int partial_start = lstart & (PAGE_SIZE - 1); > - unsigned int partial_end = (lend + 1) & (PAGE_SIZE - 1); > + pgoff_t start, end; > struct pagevec pvec; > pgoff_t indices[PAGEVEC_SIZE]; > + struct page *page; > long nr_swaps_freed = 0; > pgoff_t index; > int i; > > - if (lend == -1) > - end = -1; /* unsigned, so actually very big */ > + page = NULL; > + start = lstart >> PAGE_SHIFT; > + shmem_getpage(inode, start, &page, SGP_READ); > + if (page) { > + page = thp_head(page); > + set_page_dirty(page); > + start = truncate_inode_partial_page(mapping, page, lstart, > + lend); > + } > + > + /* 'end' includes a partial page */ > + end = lend / PAGE_SIZE; > > pagevec_init(&pvec); > index = start; > while (index < end && find_lock_entries(mapping, index, end - 1, > - &pvec, indices)) { > + &pvec, indices)) { > for (i = 0; i < pagevec_count(&pvec); i++) { > - struct page *page = pvec.pages[i]; > - > + page = pvec.pages[i]; > index = indices[i]; > > if (xa_is_value(page)) { > @@ -921,8 +902,6 @@ static void shmem_undo_range(struct inod > index, page); > continue; > } > - index += thp_nr_pages(page) - 1; > - > if (!unfalloc || !PageUptodate(page)) > truncate_inode_page(mapping, page); > unlock_page(page); > @@ -933,90 +912,60 @@ static void shmem_undo_range(struct inod > index++; > } > > - if (partial_start) { > - struct page *page = NULL; > - shmem_getpage(inode, start - 1, &page, SGP_READ); > - if (page) { > - unsigned int top = PAGE_SIZE; > - if (start > end) { > - top = partial_end; > - partial_end = 0; > - } > - zero_user_segment(page, partial_start, top); > - set_page_dirty(page); > - unlock_page(page); > - put_page(page); > - } > - } > - if (partial_end) { > - struct page *page = NULL; > - shmem_getpage(inode, end, &page, SGP_READ); > - if (page) { > - zero_user_segment(page, 0, partial_end); > - set_page_dirty(page); > - unlock_page(page); > - put_page(page); > - } > - } > - if (start >= end) > - return; > - > index = start; > - while (index < end) { > + while (index <= end) { > cond_resched(); > > - if (!find_get_entries(mapping, index, end - 1, &pvec, > - indices)) { > + if (!find_get_entries(mapping, index, end, &pvec, indices)) { > /* If all gone or hole-punch or unfalloc, we're done */ > - if (index == start || end != -1) > + if (index == start || lend != (loff_t)-1) > break; > /* But if truncating, restart to make sure all gone */ > index = start; > continue; > } > + > for (i = 0; i < pagevec_count(&pvec); i++) { > - struct page *page = pvec.pages[i]; > + page = pvec.pages[i]; > > - index = indices[i]; > if (xa_is_value(page)) { > if (unfalloc) > continue; > - if (shmem_free_swap(mapping, index, page)) { > - /* Swap was replaced by page: retry */ > - index--; > - break; > + index = indices[i]; > + if (index == end) { > + /* Partial page swapped out? */ > + shmem_getpage(inode, end, &page, > + SGP_READ); > + } else { > + if (shmem_free_swap(mapping, index, > + page)) { > + /* Swap replaced: retry */ > + break; > + } > + nr_swaps_freed++; > + continue; > } > - nr_swaps_freed++; > - continue; > + } else { > + lock_page(page); > } > > - lock_page(page); > - > if (!unfalloc || !PageUptodate(page)) { > if (page_mapping(page) != mapping) { > /* Page was replaced by swap: retry */ > unlock_page(page); > - index--; > + put_page(page); > break; > } > VM_BUG_ON_PAGE(PageWriteback(page), page); > - if (shmem_punch_compound(page, start, end)) > - truncate_inode_page(mapping, page); > - else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { > - /* Wipe the page and don't get stuck */ > - clear_highpage(page); > - flush_dcache_page(page); > - set_page_dirty(page); > - if (index < > - round_up(start, HPAGE_PMD_NR)) > - start = index + 1; > - } > + index = truncate_inode_partial_page(mapping, > + page, lstart, lend); > + if (index > end) > + end = indices[i] - 1; > } > - unlock_page(page); The fix needed is here: instead of deleting that unlock_page(page) line, it needs to be } else { unlock_page(page); } > } > + index = indices[i - 1] + 1; > pagevec_remove_exceptionals(&pvec); > - pagevec_release(&pvec); > - index++; > + pagevec_reinit(&pvec); > } > > spin_lock_irq(&info->lock); > --- a/mm/truncate.c~mm-truncateshmem-handle-truncates-that-split-thps > +++ a/mm/truncate.c > @@ -225,6 +225,63 @@ int truncate_inode_page(struct address_s > } > > /* > + * Handle partial (transparent) pages. If the page is entirely within > + * the range, we discard it. If not, we split the page if it's a THP > + * and zero the part of the page that's within the [start, end] range. > + * split_page_range() will discard any of the subpages which now lie > + * beyond i_size, and the caller will discard pages which lie within a > + * newly created hole. > + * > + * Return: The index after the current page. > + */ > +pgoff_t truncate_inode_partial_page(struct address_space *mapping, > + struct page *page, loff_t start, loff_t end) > +{ > + loff_t pos = page_offset(page); > + pgoff_t next_index = page->index + thp_nr_pages(page); > + unsigned int offset, length; > + > + if (pos < start) > + offset = start - pos; > + else > + offset = 0; > + length = thp_size(page); > + if (pos + length <= (u64)end) > + length = length - offset; > + else > + length = end + 1 - pos - offset; > + > + wait_on_page_writeback(page); > + if (length == thp_size(page)) > + goto truncate; > + > + cleancache_invalidate_page(page->mapping, page); > + if (page_has_private(page)) > + do_invalidatepage(page, offset, length); > + if (!PageTransHuge(page)) > + goto zero; > + page += offset / PAGE_SIZE; > + if (split_huge_page(page) < 0) { > + page -= offset / PAGE_SIZE; > + goto zero; > + } > + next_index = page->index + 1; > + offset %= PAGE_SIZE; > + if (offset == 0 && length >= PAGE_SIZE) > + goto truncate; > + length = PAGE_SIZE - offset; > +zero: > + zero_user(page, offset, length); > + goto out; > +truncate: > + truncate_inode_page(mapping, page); > +out: > + unlock_page(page); > + put_page(page); > + return next_index; > +} > + > +/* > * Used to get rid of pages on hardware memory corruption. > */ > int generic_error_remove_page(struct address_space *mapping, struct page *page) > @@ -275,10 +332,6 @@ int invalidate_inode_page(struct page *p > * The first pass will remove most pages, so the search cost of the second pass > * is low. > * > - * We pass down the cache-hot hint to the page freeing code. Even if the > - * mapping is large, it is probably the case that the final pages are the most > - * recently touched, and freeing happens in ascending file offset order. > - * > * Note that since ->invalidatepage() accepts range to invalidate > * truncate_inode_pages_range is able to handle cases where lend + 1 is not > * page aligned properly. > @@ -286,38 +339,24 @@ int invalidate_inode_page(struct page *p > void truncate_inode_pages_range(struct address_space *mapping, > loff_t lstart, loff_t lend) > { > - pgoff_t start; /* inclusive */ > - pgoff_t end; /* exclusive */ > - unsigned int partial_start; /* inclusive */ > - unsigned int partial_end; /* exclusive */ > + pgoff_t start, end; > struct pagevec pvec; > pgoff_t indices[PAGEVEC_SIZE]; > pgoff_t index; > int i; > + struct page * page; > > if (mapping->nrpages == 0 && mapping->nrexceptional == 0) > goto out; > > - /* Offsets within partial pages */ > - partial_start = lstart & (PAGE_SIZE - 1); > - partial_end = (lend + 1) & (PAGE_SIZE - 1); > + start = lstart >> PAGE_SHIFT; > + page = find_lock_head(mapping, start); > + if (page) > + start = truncate_inode_partial_page(mapping, page, lstart, > + lend); > > - /* > - * 'start' and 'end' always covers the range of pages to be fully > - * truncated. Partial pages are covered with 'partial_start' at the > - * start of the range and 'partial_end' at the end of the range. > - * Note that 'end' is exclusive while 'lend' is inclusive. > - */ > - start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT; > - if (lend == -1) > - /* > - * lend == -1 indicates end-of-file so we have to set 'end' > - * to the highest possible pgoff_t and since the type is > - * unsigned we're using -1. > - */ > - end = -1; > - else > - end = (lend + 1) >> PAGE_SHIFT; > + /* 'end' includes a partial page */ > + end = lend / PAGE_SIZE; > > pagevec_init(&pvec); > index = start; > @@ -334,50 +373,11 @@ void truncate_inode_pages_range(struct a > cond_resched(); > } > > - if (partial_start) { > - struct page *page = find_lock_page(mapping, start - 1); > - if (page) { > - unsigned int top = PAGE_SIZE; > - if (start > end) { > - /* Truncation within a single page */ > - top = partial_end; > - partial_end = 0; > - } > - wait_on_page_writeback(page); > - zero_user_segment(page, partial_start, top); > - cleancache_invalidate_page(mapping, page); > - if (page_has_private(page)) > - do_invalidatepage(page, partial_start, > - top - partial_start); > - unlock_page(page); > - put_page(page); > - } > - } > - if (partial_end) { > - struct page *page = find_lock_page(mapping, end); > - if (page) { > - wait_on_page_writeback(page); > - zero_user_segment(page, 0, partial_end); > - cleancache_invalidate_page(mapping, page); > - if (page_has_private(page)) > - do_invalidatepage(page, 0, > - partial_end); > - unlock_page(page); > - put_page(page); > - } > - } > - /* > - * If the truncation happened within a single page no pages > - * will be released, just zeroed, so we can bail out now. > - */ > - if (start >= end) > - goto out; > - > index = start; > - for ( ; ; ) { > + while (index <= end) { > cond_resched(); > - if (!find_get_entries(mapping, index, end - 1, &pvec, > - indices)) { > + > + if (!find_get_entries(mapping, index, end, &pvec, indices)) { > /* If all gone from start onwards, we're done */ > if (index == start) > break; > @@ -387,23 +387,20 @@ void truncate_inode_pages_range(struct a > } > > for (i = 0; i < pagevec_count(&pvec); i++) { > - struct page *page = pvec.pages[i]; > - > - /* We rely upon deletion not changing page->index */ > - index = indices[i]; > - > + page = pvec.pages[i]; > if (xa_is_value(page)) > continue; > > lock_page(page); > - WARN_ON(page_to_index(page) != index); > - wait_on_page_writeback(page); > - truncate_inode_page(mapping, page); > - unlock_page(page); > + index = truncate_inode_partial_page(mapping, page, > + lstart, lend); > + /* Couldn't split a THP? */ > + if (index > end) > + end = indices[i] - 1; > } > + index = indices[i - 1] + 1; > truncate_exceptional_pvec_entries(mapping, &pvec, indices); > - pagevec_release(&pvec); > - index++; > + pagevec_reinit(&pvec); > } > > out: > _ >