Re: [PATCH] iomap: Return zero in case of unsuccessful pagecache invalidation before DIO

Goldwyn Rodrigues <rgoldwyn@xxxxxxx> · Fri, 29 May 2020 07:45:10 -0500

On 12:50 29/05, Filipe Manana wrote:
> On Fri, May 29, 2020 at 12:31 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote:
> > > On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > > >
> > > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > > because pages could be locked as a part of the extent. Return zero
> > > >
> > > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > > them off.
> > >
> > > Yes, it will be confusing even for someone more familiar with btrfs.
> > > The changelog could be more detailed to make it clear what's happening and why.
> > >
> > > So what happens:
> > >
> > > 1) iomap_dio_rw() calls filemap_write_and_wait_range().
> > >     That starts delalloc for all dirty pages in the range and then
> > > waits for writeback to complete.
> > >     This is enough for most filesystems at least (if not all except btrfs).
> > >
> > > 2) However, in btrfs once writeback finishes, a job is queued to run
> > > on a dedicated workqueue, to execute the function
> > > btrfs_finish_ordered_io().
> > >     So that job will be run after filemap_write_and_wait_range() returns.
> > >     That function locks the file range (using a btrfs specific data
> > > structure), does a bunch of things (updating several btrees), and then
> > > unlocks the file range.
> > >
> > > 3) While iomap calls invalidate_inode_pages2_range(), which ends up
> > > calling the btrfs callback btfs_releasepage(),
> > >     btrfs_finish_ordered_io() is running and has the file range locked
> > > (this is what Goldwyn means by "pages could be locked", which is
> > > confusing because it's not about any locked struct page).
> > >
> > > 4) Because the file range is locked, btrfs_releasepage() returns 0
> > > (page can't be released), this happens in the helper function
> > > try_release_extent_state().
> > >     Any page in that range is not dirty nor under writeback anymore
> > > and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
> > > pages, it's only updating metadata.
> > >
> > >     btrfs_releasepage() in this case could release the pages, but
> > > there are other contextes where the file range is locked, the pages
> > > are still not dirty and not under writeback, where this would not be
> > > safe to do.
> >
> > Isn't this the bug, though?  Rather than returning "page can't be
> > released", shouldn't ->releasepage sleep on the extent state, at least
> > if the GFP mask indicates you can sleep?
> 
> Goldwyn mentioned in another thread that he had tried that with the
> following patch:
> 
> https://patchwork.kernel.org/patch/11275063/
> 
> But he mentioned it didn't work though, caused some locking problems.
> I don't know the details and I haven't tried the patchset yet.
> Goldwyn?
> 

Yes, direct I/O would wait for extent bits to be unlocked forever and hang.
I think it was against an fsync call, but I don't remember. In the light
of new developments, I would pursue this further. This should be valid
even in the current (before iomap patches) source.

-- 
Goldwyn