On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote: > From: David Stevens <stevensd@xxxxxxxxxxxx> > > Make sure that collapse_file doesn't interfere with checking the > uptodate flag in the page cache by only inserting hpage into the page > cache after it has been updated and marked uptodate. This is achieved by > simply not replacing present pages with hpage when iterating over the > target range. > > The present pages are already locked, so replacing them with the locked > hpage before the collapse is finalized is unnecessary. However, it is > necessary to stop freezing the present pages after validating them, > since leaving long-term frozen pages in the page cache can lead to > deadlocks. Simply checking the reference count is sufficient to ensure > that there are no long-term references hanging around that would the > collapse would break. Similar to hpage, there is no reason that the > present pages actually need to be frozen in addition to being locked. > > This fixes a race where folio_seek_hole_data would mistake hpage for > an fallocated but unwritten page. This race is visible to userspace via > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > a similar race where pages could temporarily disappear from mincore. > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx> > --- > mm/khugepaged.c | 79 ++++++++++++++++++------------------------------- > 1 file changed, 29 insertions(+), 50 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 7679551e9540..a19aa140fd52 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > * > * Basic scheme is simple, details are more complex: > * - allocate and lock a new huge page; > - * - scan page cache replacing old pages with the new one > + * - scan page cache, locking old pages > * + swap/gup in pages if necessary; > - * + keep old pages around in case rollback is required; > + * - copy data to new page > + * - handle shmem holes > + * + re-validate that holes weren't filled by someone else > + * + check for userfaultfd PS: some of the changes may belong to previous patch here, but not necessary to repost only for this, just in case there'll be a new one. > * - finalize updates to the page cache; > * - if replacing succeeds: > - * + copy data over; > - * + free old pages; > * + unlock huge page; > + * + free old pages; > * - if replacing failed; > - * + put all pages back and unfreeze them; > - * + restore gaps in the page cache; > + * + unlock old pages > * + unlock and free huge page; > */ > static int collapse_file(struct mm_struct *mm, unsigned long addr, > @@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > } while (1); > > - /* > - * At this point the hpage is locked and not up-to-date. > - * It's safe to insert it into the page cache, because nobody would > - * be able to map it or use it in another way until we unlock it. > - */ > - > xas_set(&xas, start); > for (index = start; index < end; index++) { > page = xas_next(&xas); > @@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > VM_BUG_ON_PAGE(page != xas_load(&xas), page); > > /* > - * The page is expected to have page_count() == 3: > + * We control three references to the page: > * - we hold a pin on it; > * - one reference from page cache; > * - one from isolate_lru_page; > + * If those are the only references, then any new usage of the > + * page will have to fetch it from the page cache. That requires > + * locking the page to handle truncate, so any new usage will be > + * blocked until we unlock page after collapse/during rollback. > */ > - if (!page_ref_freeze(page, 3)) { > + if (page_count(page) != 3) { > result = SCAN_PAGE_COUNT; > xas_unlock_irq(&xas); > putback_lru_page(page); Personally I don't see anything wrong with this change to resolve the dead lock. E.g. fast gup race right before unmapping the pgtables seems fine, since we'll just bail out with >3 refcounts (or fast-gup bails out by checking pte changes). Either way looks fine here. So far it looks good to me, but that may not mean much per the history on what I can overlook. It'll be always good to hear from Hugh and others. -- Peter Xu