On Fri, Feb 10, 2023 at 08:53:58AM +1100, Dave Chinner wrote: > On Thu, Feb 09, 2023 at 02:44:20AM +0000, Matthew Wilcox wrote: > > On Thu, Feb 09, 2023 at 08:53:11AM +1100, Dave Chinner wrote: > > > > If XFS really needs it, > > > > it can trylock the semaphore and return 0 if it fails, falling back to > > > > the ->fault path. But I don't think XFS actually needs it. > > > > > > > > The ->map_pages path trylocks the folio, checks the folio->mapping, > > > > checks uptodate, then checks beyond EOF (not relevant to hole punch). > > > > Then it takes the page table lock and puts the page(s) into the page > > > > tables, unlocks the folio and moves on to the next folio. > > > > > > > > The hole-punch path, like the truncate path, takes the folio lock, > > > > unmaps the folio (which will take the page table lock) and removes > > > > it from the page cache. > > > > > > > > So what's the race? > > > > > > Hole punch is a multi-folio operation, so while we are operating on > > > invalidating one folio, another folio in the range we've already > > > invalidated could be instantiated and mapped, leaving mapped > > > up-to-date pages over a range we *require* the page cache to empty. > > > > Nope. ->map_pages is defined to _not_ instantiate new pages. > > If there are uptodate pages in the page cache, they can be mapped, but > > missing pages will be skipped, and left to ->fault to bring in. > > Sure, but *at the time this change was made* other operations could > instantiate pages whilst an invalidate was running, and then > ->map_pages could also find them and map them whilst that > invalidation was still running. i.e. the race conditions that > existed before the mapping->invalidate_lock was introduced (ie. we > couldn't intercept read page faults instantiating pages in the page > cache at all) didn't require ->map_pages to instantiate the page for > it to be able to expose incorrect data to userspace when page faults > raced with an ongoing invalidation operation. > > While this may not be able to happen now if everything is using the > mapping->invalidate_lock correctly (because read faults are now > intercepted before they can instatiate new page cache pages), it > doesn't mean it wasn't possible in the past..... Sorry, still not getting it. Here's the scenario I think you're talking about. We have three threads (probably in different tasks or they may end up getting synchronized on the page table locks). Thread 1 is calling FALLOC_FL_PUNCH_HOLE over a nice wide range. Thread 2 has the file mmaped and takes a read page fault. Thread 3 also has the file mmaped and also takes a read page fault. Thread 2 calls filemap_map_pages and finds the pages gone. It proceeds to call xfs_filemap_fault() which calls filemap_fault() without taking any XFS locks. filemap_fault() kicks off some readahead which allocates some pages & puts them in the page cache. It calls into xfs_vm_readahead() which calls iomap_readahead() without taking any XFS locks. iomap_readahead() will then call back into xfs_read_iomap_begin() which takes the XFS_ILOCK_SHARED. Since thread 1 is holding XFS_IOLOCK_EXCL, I presume thread 2 will block at this point until thread 1 is done. At this point, the page is still not uptodate, so thread 3 will not map the page if it finds it in >map_pages. Or have I misunderstood XFS inode locking? Entirely possible, it seems quite complicated. Nevertheless, it seems to me that if there's locking that's missing, there's ample opportunities for XFS to take those missing locks in the (slow) fault path, and not take them in the (fast) map_pages path.