Hi, On 2023-04-04 21:01:17 +0900, David Stevens wrote: > From: David Stevens <stevensd@xxxxxxxxxxxx> > > Make sure that collapse_file doesn't interfere with checking the > uptodate flag in the page cache by only inserting hpage into the page > cache after it has been updated and marked uptodate. This is achieved by > simply not replacing present pages with hpage when iterating over the > target range. > > The present pages are already locked, so replacing them with the locked > hpage before the collapse is finalized is unnecessary. However, it is > necessary to stop freezing the present pages after validating them, > since leaving long-term frozen pages in the page cache can lead to > deadlocks. Simply checking the reference count is sufficient to ensure > that there are no long-term references hanging around that would the > collapse would break. Similar to hpage, there is no reason that the > present pages actually need to be frozen in addition to being locked. > > This fixes a race where folio_seek_hole_data would mistake hpage for > an fallocated but unwritten page. This race is visible to userspace via > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > a similar race where pages could temporarily disappear from mincore. > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx> I noticed that recently MADV_COLLAPSE stopped being able to collapse a binary's executable code, always failing with EAGAIN. I bisected it down to a2e17cc2efc7 - this commit. Using perf trace -e 'huge_memory:*' -a I see 1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17) 1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) 1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17) 1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) 1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17) 1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) 1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17) 1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) for every attempt at doing madvise(MADV_COLLAPSE). I'm sad about that, because MADV_COLLAPSE was the first thing that allowed using huge pages for executable code that wasn't entirely completely gross. I don't yet have a standalone repro, but can write one if that's helpful. Greetings, Andres Freund