On Tue, 4 Apr 2023, Peter Xu wrote: > On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote: > > From: David Stevens <stevensd@xxxxxxxxxxxx> > > > > Make sure that collapse_file doesn't interfere with checking the > > uptodate flag in the page cache by only inserting hpage into the page > > cache after it has been updated and marked uptodate. This is achieved by > > simply not replacing present pages with hpage when iterating over the > > target range. > > > > The present pages are already locked, so replacing them with the locked > > hpage before the collapse is finalized is unnecessary. However, it is > > necessary to stop freezing the present pages after validating them, > > since leaving long-term frozen pages in the page cache can lead to > > deadlocks. Simply checking the reference count is sufficient to ensure > > that there are no long-term references hanging around that would the > > collapse would break. Similar to hpage, there is no reason that the > > present pages actually need to be frozen in addition to being locked. > > > > This fixes a race where folio_seek_hole_data would mistake hpage for > > an fallocated but unwritten page. This race is visible to userspace via > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > > a similar race where pages could temporarily disappear from mincore. > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx> ... > > Personally I don't see anything wrong with this change to resolve the dead > lock. E.g. fast gup race right before unmapping the pgtables seems fine, > since we'll just bail out with >3 refcounts (or fast-gup bails out by > checking pte changes). Either way looks fine here. > > So far it looks good to me, but that may not mean much per the history on > what I can overlook. It'll be always good to hear from Hugh and others. I'm uneasy about it, and haven't let it sink in for long enough: but haven't spotted anything wrong with it, nor experienced any trouble. I would have much preferred David to stick with the current scheme, and fix up seek_hole_data, and be less concerned with the mincore transients: this patch makes a significant change that is difficult to be sure of. I was dubious about the unfrozen "page_count(page) != 3" check (where another task can grab a reference an instant later), but perhaps it does serve a purpose, since we hold the page lock there: excludes concurrent shmem reads which grab but drop page lock before copying (though it's not clear that those do actually need excluding). I had thought shmem was peculiar in relying on page lock while writing, but turn out to be quite wrong about that: most filesystems rely on page lock while writing, though I'm not sure whether that's true of all (and it doesn't matter while collapse of non-shmem file is only permitted on read-only). We shall see. Hugh