Re: [PATCH] iomap: Return zero in case of unsuccessful pagecache invalidation before DIO

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 29 May 2020 04:31:16 -0700

On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote:
> On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
> >
> > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > >
> > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > because pages could be locked as a part of the extent. Return zero
> >
> > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > them off.
> 
> Yes, it will be confusing even for someone more familiar with btrfs.
> The changelog could be more detailed to make it clear what's happening and why.
> 
> So what happens:
> 
> 1) iomap_dio_rw() calls filemap_write_and_wait_range().
>     That starts delalloc for all dirty pages in the range and then
> waits for writeback to complete.
>     This is enough for most filesystems at least (if not all except btrfs).
> 
> 2) However, in btrfs once writeback finishes, a job is queued to run
> on a dedicated workqueue, to execute the function
> btrfs_finish_ordered_io().
>     So that job will be run after filemap_write_and_wait_range() returns.
>     That function locks the file range (using a btrfs specific data
> structure), does a bunch of things (updating several btrees), and then
> unlocks the file range.
> 
> 3) While iomap calls invalidate_inode_pages2_range(), which ends up
> calling the btrfs callback btfs_releasepage(),
>     btrfs_finish_ordered_io() is running and has the file range locked
> (this is what Goldwyn means by "pages could be locked", which is
> confusing because it's not about any locked struct page).
> 
> 4) Because the file range is locked, btrfs_releasepage() returns 0
> (page can't be released), this happens in the helper function
> try_release_extent_state().
>     Any page in that range is not dirty nor under writeback anymore
> and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
> pages, it's only updating metadata.
> 
>     btrfs_releasepage() in this case could release the pages, but
> there are other contextes where the file range is locked, the pages
> are still not dirty and not under writeback, where this would not be
> safe to do.

Isn't this the bug, though?  Rather than returning "page can't be
released", shouldn't ->releasepage sleep on the extent state, at least
if the GFP mask indicates you can sleep?

> 5) So because of that invalidate_inode_pages2_range() returns
> non-zero, the iomap code prints that warning message and then proceeds
> with doing a direct IO write anyway.
> 
> What happens currently in btrfs, before Goldwyn's patchset:
> 
> 1) generic_file_direct_write() also calls filemap_write_and_wait_range().
> 2) After that it calls invalidate_inode_pages2_range() too, but if
> that returns non-zero, it doesn't print any warning and falls back to
> a buffered write.
> 
> So Goldwyn here is effectively adding that behaviour from
> generic_file_direct_write() to iomap.
> 
> Thanks.
> 
> >
> > > in case a page cache invalidation is unsuccessful so filesystems can
> > > fallback to buffered I/O. This is similar to
> > > generic_file_direct_write().
> > >
> > > This takes care of the following invalidation warning during btrfs
> > > mixed buffered and direct I/O using iomap_dio_rw():
> > >
> > > Page cache invalidation failure on direct I/O.  Possible data
> > > corruption due to collision with buffered I/O!
> > >
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
> > >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index e4addfc58107..215315be6233 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >        */
> > >       ret = invalidate_inode_pages2_range(mapping,
> > >                       pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -     if (ret)
> > > -             dio_warn_stale_pagecache(iocb->ki_filp);
> > > -     ret = 0;
> > > +     /*
> > > +      * If a page can not be invalidated, return 0 to fall back
> > > +      * to buffered write.
> > > +      */
> > > +     if (ret) {
> > > +             if (ret == -EBUSY)
> > > +                     ret = 0;
> > > +             goto out_free_dio;
> >
> > XFS doesn't fall back to buffered io when directio fails, which means
> > this will cause a regression there.
> >
> > Granted mixing write types is bogus...
> >
> > --D
> >
> > > +     }
> > >
> > >       if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > >           !inode->i_sb->s_dio_done_wq) {
> > >
> > > --
> > > Goldwyn
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”