On Wed, Feb 08, 2023 at 05:12:06PM +0000, Matthew Wilcox wrote: > On Wed, Feb 08, 2023 at 08:39:19AM -0800, Darrick J. Wong wrote: > > On Wed, Feb 08, 2023 at 02:53:33PM +0000, Matthew Wilcox (Oracle) wrote: > > > XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED > > > to do this, any more than it needs the XFS_MMAPLOCK_SHARED for a > > > read() that hits in the page cache. > > > > Hmm. From commit cd647d5651c0 ("xfs: use MMAPLOCK around > > filemap_map_pages()"): > > > > The page faultround path ->map_pages is implemented in XFS via > > filemap_map_pages(). This function checks that pages found in page > > cache lookups have not raced with truncate based invalidation by > > checking page->mapping is correct and page->index is within EOF. > > > > However, we've known for a long time that this is not sufficient to > > protect against races with invalidations done by operations that do > > not change EOF. e.g. hole punching and other fallocate() based > > direct extent manipulations. The way we protect against these > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > > lock so they serialise against fallocate and truncate before calling > > into the filemap function that processes the fault. > > > > Do the same for XFS's ->map_pages implementation to close this > > potential data corruption issue. > > > > How do we prevent faultaround from racing with fallocate and reflink > > calls that operate below EOF? > > I don't understand the commit message. It'd be nice to have an example > of what's insufficient about the protection. When this change was made, "insufficient protection" was a reference to the rather well known fact we'd been bugging MM developers about for well over a decade (i.e. since before ->page_mkwrite existed) that the unlocked page invalidation detection hack used everywhere in the page cache code was broken for page invalidation within EOF. i.e. that cannot be correctly detected by (page->mapping == NULL && page->index > EOF) checks. This was a long standing problem, so after a decade of being ignored, the MMAPLOCK was added to XFS to serialise invalidation against page fault based operations. At the time page faults could instantiate page cache pages whilst invalidation operations like truncate_pagecache_range() were running and hence page faults could be instantiating and mapping pages over the range we are trying to invalidate. We were also finding niche syscalls that caused data corruption due to invalidation races (e.g. see xfs_file_fadvise() to avoid readahead vs hole punch races from fadvise(WILLNEED) and readahead() syscalls), so I did an audit to look for any potential interfaces that could race with invalidation. ->map_pages() being called from within the page fault code and having a broken page->index based check for invalidation looked suspect and potentially broken. Hence I slapped the MMAPLOCK around it to stop it from running while a XFS driven page cache invalidation operation was in progress. We work on the principle that when it comes to data corruption vectors, it is far better to err on the side of safety than it is to play fast and loose. fault-around is a perf optimisation, and taking a rwsem in shared mode is not a major increase in overhead for that path, so there was little risk of regressions in adding serialisation just in case there was an as-yet-unknown data corruption vector from that path. Keep in mind this was written before the mm code handled page cache instantiation serialisation sanely via the mapping->invalidation_lock. The mapping->invalidation_lock solves the same issues in a slightly different way, and it may well be that the different implementation means that we don't need to use it in all the places we place the MMAPLOCK in XFS originally. > If XFS really needs it, > it can trylock the semaphore and return 0 if it fails, falling back to > the ->fault path. But I don't think XFS actually needs it. > > The ->map_pages path trylocks the folio, checks the folio->mapping, > checks uptodate, then checks beyond EOF (not relevant to hole punch). > Then it takes the page table lock and puts the page(s) into the page > tables, unlocks the folio and moves on to the next folio. > > The hole-punch path, like the truncate path, takes the folio lock, > unmaps the folio (which will take the page table lock) and removes > it from the page cache. > > So what's the race? Hole punch is a multi-folio operation, so while we are operating on invalidating one folio, another folio in the range we've already invalidated could be instantiated and mapped, leaving mapped up-to-date pages over a range we *require* the page cache to empty. The original MMAPLOCK could not prevent the instantiation of new page cache pages while an invalidation was running, hence we had to block any operation from page faults that instantiated pages into the page cache or operated on the page cache in any way while an invalidation was being run. The mapping->invalidation_lock solved this specific aspect of the problem, so it's entirely possible that we don't have to care about using MMAPLOCK for filemap_map_pages() any more. But I don't know that for certain, I haven't had any time to investigate it in any detail, and when it comes to data corruption vectors I'm not going to change serialisation mechanisms without a decent amount of investigation. I couldn't ever convince myself there wasn't a problem hence the comment in the commit: "Do the same for XFS's ->map_pages implementation to close this potential data corruption issue." Hence if you can explain to me how filemap_map_pages() cannot race against invalidation without holding the mapping->invalidation_lock without potentially leaving stale data in the page cache over the invalidated range (this isn't an XFS specific issue!), then I don't see a problem with removing the MMAPLOCK from this path. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx