On Fri, May 29, 2020 at 12:31 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote: > > On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > > > > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote: > > > > > > > > Filesystems such as btrfs are unable to guarantee page invalidation > > > > because pages could be locked as a part of the extent. Return zero > > > > > > Locked for what? filemap_write_and_wait_range should have just cleaned > > > them off. > > > > Yes, it will be confusing even for someone more familiar with btrfs. > > The changelog could be more detailed to make it clear what's happening and why. > > > > So what happens: > > > > 1) iomap_dio_rw() calls filemap_write_and_wait_range(). > > That starts delalloc for all dirty pages in the range and then > > waits for writeback to complete. > > This is enough for most filesystems at least (if not all except btrfs). > > > > 2) However, in btrfs once writeback finishes, a job is queued to run > > on a dedicated workqueue, to execute the function > > btrfs_finish_ordered_io(). > > So that job will be run after filemap_write_and_wait_range() returns. > > That function locks the file range (using a btrfs specific data > > structure), does a bunch of things (updating several btrees), and then > > unlocks the file range. > > > > 3) While iomap calls invalidate_inode_pages2_range(), which ends up > > calling the btrfs callback btfs_releasepage(), > > btrfs_finish_ordered_io() is running and has the file range locked > > (this is what Goldwyn means by "pages could be locked", which is > > confusing because it's not about any locked struct page). > > > > 4) Because the file range is locked, btrfs_releasepage() returns 0 > > (page can't be released), this happens in the helper function > > try_release_extent_state(). > > Any page in that range is not dirty nor under writeback anymore > > and, in fact, btrfs_finished_ordered_io() doesn't do anything with the > > pages, it's only updating metadata. > > > > btrfs_releasepage() in this case could release the pages, but > > there are other contextes where the file range is locked, the pages > > are still not dirty and not under writeback, where this would not be > > safe to do. > > Isn't this the bug, though? Rather than returning "page can't be > released", shouldn't ->releasepage sleep on the extent state, at least > if the GFP mask indicates you can sleep? Goldwyn mentioned in another thread that he had tried that with the following patch: https://patchwork.kernel.org/patch/11275063/ But he mentioned it didn't work though, caused some locking problems. I don't know the details and I haven't tried the patchset yet. Goldwyn? > > > 5) So because of that invalidate_inode_pages2_range() returns > > non-zero, the iomap code prints that warning message and then proceeds > > with doing a direct IO write anyway. > > > > What happens currently in btrfs, before Goldwyn's patchset: > > > > 1) generic_file_direct_write() also calls filemap_write_and_wait_range(). > > 2) After that it calls invalidate_inode_pages2_range() too, but if > > that returns non-zero, it doesn't print any warning and falls back to > > a buffered write. > > > > So Goldwyn here is effectively adding that behaviour from > > generic_file_direct_write() to iomap. > > > > Thanks. > > > > > > > > > in case a page cache invalidation is unsuccessful so filesystems can > > > > fallback to buffered I/O. This is similar to > > > > generic_file_direct_write(). > > > > > > > > This takes care of the following invalidation warning during btrfs > > > > mixed buffered and direct I/O using iomap_dio_rw(): > > > > > > > > Page cache invalidation failure on direct I/O. Possible data > > > > corruption due to collision with buffered I/O! > > > > > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx> > > > > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > > > index e4addfc58107..215315be6233 100644 > > > > --- a/fs/iomap/direct-io.c > > > > +++ b/fs/iomap/direct-io.c > > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > > > */ > > > > ret = invalidate_inode_pages2_range(mapping, > > > > pos >> PAGE_SHIFT, end >> PAGE_SHIFT); > > > > - if (ret) > > > > - dio_warn_stale_pagecache(iocb->ki_filp); > > > > - ret = 0; > > > > + /* > > > > + * If a page can not be invalidated, return 0 to fall back > > > > + * to buffered write. > > > > + */ > > > > + if (ret) { > > > > + if (ret == -EBUSY) > > > > + ret = 0; > > > > + goto out_free_dio; > > > > > > XFS doesn't fall back to buffered io when directio fails, which means > > > this will cause a regression there. > > > > > > Granted mixing write types is bogus... > > > > > > --D > > > > > > > + } > > > > > > > > if (iov_iter_rw(iter) == WRITE && !wait_for_completion && > > > > !inode->i_sb->s_dio_done_wq) { > > > > > > > > -- > > > > Goldwyn > > > > > > > > -- > > Filipe David Manana, > > > > “Whether you think you can, or you think you can't — you're right.” -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”