On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > The page faultround path ->map_pages is implemented in XFS via > filemap_map_pages(). This function checks that pages found in page > cache lookups have not raced with truncate based invalidation by > checking page->mapping is correct and page->index is within EOF. > > However, we've known for a long time that this is not sufficient to > protect against races with invalidations done by operations that do > not change EOF. e.g. hole punching and other fallocate() based > direct extent manipulations. The way we protect against these > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > lock so they serialise against fallocate and truncate before calling > into the filemap function that processes the fault. > > Do the same for XFS's ->map_pages implementation to close this > potential data corruption issue. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_file.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 7b05f8fd7b3d..4b185a907432 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > +static void > +xfs_filemap_map_pages( > + struct vm_fault *vmf, > + pgoff_t start_pgoff, > + pgoff_t end_pgoff) > +{ > + struct inode *inode = file_inode(vmf->vma->vm_file); > + > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > +} > + > static const struct vm_operations_struct xfs_file_vm_ops = { > .fault = xfs_filemap_fault, > .huge_fault = xfs_filemap_huge_fault, > - .map_pages = filemap_map_pages, > + .map_pages = xfs_filemap_map_pages, > .page_mkwrite = xfs_filemap_page_mkwrite, > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > }; > -- > 2.26.2.761.g0e0b3e54be > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix zonefs does not support hole punching, so it may not need to use mmap_sem at all. It is interesting to look at how this bug came to be duplicated in so many filesystems, because there are lessons to be learned. Commit f1820361f83d ("mm: implement ->map_pages for page cache") added to ->map_pages() operation and its commit message said: "...It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault()." At the time, all of the aforementioned filesystems used filemap_fault() for ->fault(). But since then, ext4, xfs, f2fs and just recently gfs2 have added a filesystem ->fault() operation. orangefs has added vm_operations since and zonefs was added since, probably copying the mmap_sem handling from ext4. Both have a filesystem ->fault() operation. It was surprising for me to see that some of the filesystem developers signed on the added ->fault() operations are not strangers to mm. The recent gfs2 change was even reviewed by an established mm developer [1]. So what can we learn from this case study? How could we fix the interface to avoid repeating the same mistake in the future? Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20200703113801.GD25523@xxxxxxxxxxxxxxxxxxxx/