Re: [PATCH] iomap: Return zero in case of unsuccessful pagecache invalidation before DIO

Filipe Manana <fdmanana@xxxxxxxxx> · Fri, 29 May 2020 11:55:33 +0100

On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
>
> On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> >
> > Filesystems such as btrfs are unable to guarantee page invalidation
> > because pages could be locked as a part of the extent. Return zero
>
> Locked for what?  filemap_write_and_wait_range should have just cleaned
> them off.

Yes, it will be confusing even for someone more familiar with btrfs.
The changelog could be more detailed to make it clear what's happening and why.

So what happens:

1) iomap_dio_rw() calls filemap_write_and_wait_range().
    That starts delalloc for all dirty pages in the range and then
waits for writeback to complete.
    This is enough for most filesystems at least (if not all except btrfs).

2) However, in btrfs once writeback finishes, a job is queued to run
on a dedicated workqueue, to execute the function
btrfs_finish_ordered_io().
    So that job will be run after filemap_write_and_wait_range() returns.
    That function locks the file range (using a btrfs specific data
structure), does a bunch of things (updating several btrees), and then
unlocks the file range.

3) While iomap calls invalidate_inode_pages2_range(), which ends up
calling the btrfs callback btfs_releasepage(),
    btrfs_finish_ordered_io() is running and has the file range locked
(this is what Goldwyn means by "pages could be locked", which is
confusing because it's not about any locked struct page).

4) Because the file range is locked, btrfs_releasepage() returns 0
(page can't be released), this happens in the helper function
try_release_extent_state().
    Any page in that range is not dirty nor under writeback anymore
and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
pages, it's only updating metadata.

    btrfs_releasepage() in this case could release the pages, but
there are other contextes where the file range is locked, the pages
are still not dirty and not under writeback, where this would not be
safe to do.

5) So because of that invalidate_inode_pages2_range() returns
non-zero, the iomap code prints that warning message and then proceeds
with doing a direct IO write anyway.

What happens currently in btrfs, before Goldwyn's patchset:

1) generic_file_direct_write() also calls filemap_write_and_wait_range().
2) After that it calls invalidate_inode_pages2_range() too, but if
that returns non-zero, it doesn't print any warning and falls back to
a buffered write.

So Goldwyn here is effectively adding that behaviour from
generic_file_direct_write() to iomap.

Thanks.

>
> > in case a page cache invalidation is unsuccessful so filesystems can
> > fallback to buffered I/O. This is similar to
> > generic_file_direct_write().
> >
> > This takes care of the following invalidation warning during btrfs
> > mixed buffered and direct I/O using iomap_dio_rw():
> >
> > Page cache invalidation failure on direct I/O.  Possible data
> > corruption due to collision with buffered I/O!
> >
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
> >
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index e4addfc58107..215315be6233 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >        */
> >       ret = invalidate_inode_pages2_range(mapping,
> >                       pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -     if (ret)
> > -             dio_warn_stale_pagecache(iocb->ki_filp);
> > -     ret = 0;
> > +     /*
> > +      * If a page can not be invalidated, return 0 to fall back
> > +      * to buffered write.
> > +      */
> > +     if (ret) {
> > +             if (ret == -EBUSY)
> > +                     ret = 0;
> > +             goto out_free_dio;
>
> XFS doesn't fall back to buffered io when directio fails, which means
> this will cause a regression there.
>
> Granted mixing write types is bogus...
>
> --D
>
> > +     }
> >
> >       if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> >           !inode->i_sb->s_dio_done_wq) {
> >
> > --
> > Goldwyn

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”