Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 17 Sep 2020 11:44:54 +1000

On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> > On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > >
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > >
> > > The page faultround path ->map_pages is implemented in XFS via
> > > filemap_map_pages(). This function checks that pages found in page
> > > cache lookups have not raced with truncate based invalidation by
> > > checking page->mapping is correct and page->index is within EOF.
> > >
> > > However, we've known for a long time that this is not sufficient to
> > > protect against races with invalidations done by operations that do
> > > not change EOF. e.g. hole punching and other fallocate() based
> > > direct extent manipulations. The way we protect against these
> > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > > lock so they serialise against fallocate and truncate before calling
> > > into the filemap function that processes the fault.
> > >
> > > Do the same for XFS's ->map_pages implementation to close this
> > > potential data corruption issue.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > ---
> > >  fs/xfs/xfs_file.c | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 7b05f8fd7b3d..4b185a907432 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> > >         return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> > >  }
> > >
> > > +static void
> > > +xfs_filemap_map_pages(
> > > +       struct vm_fault         *vmf,
> > > +       pgoff_t                 start_pgoff,
> > > +       pgoff_t                 end_pgoff)
> > > +{
> > > +       struct inode            *inode = file_inode(vmf->vma->vm_file);
> > > +
> > > +       xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +       filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > > +       xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +}
> > > +
> > >  static const struct vm_operations_struct xfs_file_vm_ops = {
> > >         .fault          = xfs_filemap_fault,
> > >         .huge_fault     = xfs_filemap_huge_fault,
> > > -       .map_pages      = filemap_map_pages,
> > > +       .map_pages      = xfs_filemap_map_pages,
> > >         .page_mkwrite   = xfs_filemap_page_mkwrite,
> > >         .pfn_mkwrite    = xfs_filemap_pfn_mkwrite,
> > >  };
> > > --
> > > 2.26.2.761.g0e0b3e54be
> > >
> > 
> > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix
> > 
> > zonefs does not support hole punching, so it may not need to use
> > mmap_sem at all.
> 
> So I've written an ext4 fix for this but before actually posting it Nikolay
> working on btrfs fix asked why exactly is filemap_map_pages() actually a
> problem and I think he's right it actually isn't a problem. The thing is:
> filemap_map_pages() never does any page mapping or IO. It only creates PTEs
> for uptodate pages that are already in page cache.

So....

P0					p1

hole punch starts
  takes XFS_MMAPLOCK_EXCL
  truncate_pagecache_range()
    unmap_mapping_range(start, end)
      <clears ptes>
					<read fault>
					do_fault_around()
					  ->map_pages
					    filemap_map_pages()
					      page mapping valid,
					      page is up to date
					      maps PTEs
					<fault done>
    truncate_inode_pages_range()
      truncate_cleanup_page(page)
        invalidates page
      delete_from_page_cache_batch(page)
        frees page
					<pte now points to a freed page>

That doesn't seem good to me.

Sure, maybe the page hasn't been freed back to the free lists
because of elevated refcounts. But it's been released by the
filesystem and not longer in the page cache so nothing good can come
of this situation...

AFAICT, this race condition exists for the truncate case as well
as filemap_map_pages() doesn't have a "page beyond inode size" check
in it. However, exposure here is very limited in the truncate case
because truncate_setsize()->truncate_pagecache() zaps the PTEs
again after invalidating the page cache.

Either way, adding the XFS_MMAPLOCK_SHARED around
filemap_map_pages() avoids the race condition for fallocate and
truncate operations for XFS...

> As such it is a rather
> different beast compared to the fault handler from fs POV and does not need
> protection from hole punching (current serialization on page lock and
> checking of page->mapping is enough).
> That being said I agree this is subtle and the moment someone adds e.g. a
> readahead call into filemap_map_pages() we have a real problem. I'm not
> sure how to prevent this risk...

Subtle, yes. So subtle, in fact, I fail to see any reason why the
above race cannot occur as there's no obvious serialisation in the
page cache between PTE zapping and page invalidation to prevent a
fault from coming in an re-establishing the PTEs before invalidation
occurs.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx