On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote: > On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote: > > > > > > Filesystems such as btrfs are unable to guarantee page invalidation > > > because pages could be locked as a part of the extent. Return zero > > > > Locked for what? filemap_write_and_wait_range should have just cleaned > > them off. > > Yes, it will be confusing even for someone more familiar with btrfs. > The changelog could be more detailed to make it clear what's happening and why. > > So what happens: > > 1) iomap_dio_rw() calls filemap_write_and_wait_range(). > That starts delalloc for all dirty pages in the range and then > waits for writeback to complete. > This is enough for most filesystems at least (if not all except btrfs). > > 2) However, in btrfs once writeback finishes, a job is queued to run > on a dedicated workqueue, to execute the function > btrfs_finish_ordered_io(). > So that job will be run after filemap_write_and_wait_range() returns. > That function locks the file range (using a btrfs specific data > structure), does a bunch of things (updating several btrees), and then > unlocks the file range. > > 3) While iomap calls invalidate_inode_pages2_range(), which ends up > calling the btrfs callback btfs_releasepage(), > btrfs_finish_ordered_io() is running and has the file range locked > (this is what Goldwyn means by "pages could be locked", which is > confusing because it's not about any locked struct page). > > 4) Because the file range is locked, btrfs_releasepage() returns 0 > (page can't be released), this happens in the helper function > try_release_extent_state(). > Any page in that range is not dirty nor under writeback anymore > and, in fact, btrfs_finished_ordered_io() doesn't do anything with the > pages, it's only updating metadata. > > btrfs_releasepage() in this case could release the pages, but > there are other contextes where the file range is locked, the pages > are still not dirty and not under writeback, where this would not be > safe to do. Isn't this the bug, though? Rather than returning "page can't be released", shouldn't ->releasepage sleep on the extent state, at least if the GFP mask indicates you can sleep? > 5) So because of that invalidate_inode_pages2_range() returns > non-zero, the iomap code prints that warning message and then proceeds > with doing a direct IO write anyway. > > What happens currently in btrfs, before Goldwyn's patchset: > > 1) generic_file_direct_write() also calls filemap_write_and_wait_range(). > 2) After that it calls invalidate_inode_pages2_range() too, but if > that returns non-zero, it doesn't print any warning and falls back to > a buffered write. > > So Goldwyn here is effectively adding that behaviour from > generic_file_direct_write() to iomap. > > Thanks. > > > > > > in case a page cache invalidation is unsuccessful so filesystems can > > > fallback to buffered I/O. This is similar to > > > generic_file_direct_write(). > > > > > > This takes care of the following invalidation warning during btrfs > > > mixed buffered and direct I/O using iomap_dio_rw(): > > > > > > Page cache invalidation failure on direct I/O. Possible data > > > corruption due to collision with buffered I/O! > > > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx> > > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > > index e4addfc58107..215315be6233 100644 > > > --- a/fs/iomap/direct-io.c > > > +++ b/fs/iomap/direct-io.c > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > > */ > > > ret = invalidate_inode_pages2_range(mapping, > > > pos >> PAGE_SHIFT, end >> PAGE_SHIFT); > > > - if (ret) > > > - dio_warn_stale_pagecache(iocb->ki_filp); > > > - ret = 0; > > > + /* > > > + * If a page can not be invalidated, return 0 to fall back > > > + * to buffered write. > > > + */ > > > + if (ret) { > > > + if (ret == -EBUSY) > > > + ret = 0; > > > + goto out_free_dio; > > > > XFS doesn't fall back to buffered io when directio fails, which means > > this will cause a regression there. > > > > Granted mixing write types is bogus... > > > > --D > > > > > + } > > > > > > if (iov_iter_rw(iter) == WRITE && !wait_for_completion && > > > !inode->i_sb->s_dio_done_wq) { > > > > > > -- > > > Goldwyn > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.”