On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote: > The writeback code caches the current extent mapping across multiple > xfs_do_writepage() calls to avoid repeated lookups for sequential > pages backed by the same extent. This is known to be slightly racy > with extent fork changes in certain difficult to reproduce > scenarios. The cached extent is trimmed to within EOF to help avoid > the most common vector for this problem via speculative > preallocation management, but this is a band-aid that does not > address the fundamental problem. > > Now that we have an xfs_ifork sequence counter mechanism used to > facilitate COW writeback, we can use the same mechanism to validate > consistency between the data fork and cached writeback mappings. On > its face, this is somewhat of a big hammer approach because any > change to the data fork invalidates any mapping currently cached by > a writeback in progress regardless of whether the data fork change > overlaps with the range under writeback. In practice, however, the > impact of this approach is minimal in most cases. > > First, data fork changes (delayed allocations) caused by sustained > sequential buffered writes are amortized across speculative > preallocations. This means that a cached mapping won't be > invalidated by each buffered write of a common file copy workload, > but rather only on less frequent allocation events. Second, the > extent tree is always entirely in-core so an additional lookup of a > usable extent mostly costs a shared ilock cycle and in-memory tree > lookup. This means that a cached mapping reval is relatively cheap > compared to the I/O itself. Third, spurious invalidations don't > impact ioend construction. This means that even if the same extent > is revalidated multiple times across multiple writepage instances, > we still construct and submit the same size ioend (and bio) if the > blocks are physically contiguous. > > Update struct xfs_writepage_ctx with a new field to hold the > sequence number of the data fork associated with the currently > cached mapping. Check the wpc seqno against the data fork when the > mapping is validated and reestablish the mapping whenever the fork > has changed since the mapping was cached. This ensures that > writeback always uses a valid extent mapping and thus prevents lost > writebacks and stale delalloc block problems. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/xfs/xfs_aops.c | 8 ++++++-- > fs/xfs/xfs_iomap.c | 4 ++-- > 2 files changed, 8 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index d9048bcea49c..33a1be5df99f 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -29,6 +29,7 @@ > struct xfs_writepage_ctx { > struct xfs_bmbt_irec imap; > unsigned int io_type; > + unsigned int data_seq; > unsigned int cow_seq; > struct xfs_ioend *ioend; > }; > @@ -347,7 +348,8 @@ xfs_map_blocks( > * out that ensures that we always see the current value. > */ > imap_valid = offset_fsb >= wpc->imap.br_startoff && > - offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount; > + offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount && > + wpc->data_seq == READ_ONCE(ip->i_df.if_seq); > if (imap_valid && > (!xfs_inode_has_cow_data(ip) || > wpc->io_type == XFS_IO_COW || I suspect this next "if (imap_valid) ..." logic needs to be updated, too. i.e. the next line is checking if the cow_seq has not changed. i.e. I think wrapping this up in a helper (again!) might make more sense: static bool xfs_imap_valid( struct xfs_inode *ip, struct xfs_writepage_ctx *wpc, xfs_fileoff_t offset_fsb) { if (offset_fsb < wpc->imap.br_startoff) return false; if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount) return false; if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq) return false; if (!xfs_inode_has_cow_data(ip)) return true; if (wpc->io_type != XFS_IO_COW) return true; if (wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq) return false; return true; } and then put the shutdown check before we check the map for validity (i.e. don't continue to write to the cached map after a shutdown has been triggered): if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; if (xfs_imap_valid(ip, wpc, offset_fsb)) return 0; > @@ -417,6 +419,7 @@ xfs_map_blocks( > */ > if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap)) > imap.br_startoff = end_fsb; /* fake a hole past EOF */ > + wpc->data_seq = READ_ONCE(ip->i_df.if_seq); > xfs_iunlock(ip, XFS_ILOCK_SHARED); > > if (imap.br_startoff > offset_fsb) { > @@ -454,7 +457,8 @@ xfs_map_blocks( > return 0; > allocate_blocks: > error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap, > - &wpc->cow_seq); > + whichfork == XFS_COW_FORK ? > + &wpc->cow_seq : &wpc->data_seq); > if (error) > return error; > ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF || > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 27c93b5f029d..0401e33d4e8f 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -681,7 +681,7 @@ xfs_iomap_write_allocate( > int whichfork, > xfs_off_t offset, > xfs_bmbt_irec_t *imap, > - unsigned int *cow_seq) > + unsigned int *seq) > { > xfs_mount_t *mp = ip->i_mount; > struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); > @@ -798,7 +798,7 @@ xfs_iomap_write_allocate( > goto error0; > > if (whichfork == XFS_COW_FORK) > - *cow_seq = READ_ONCE(ifp->if_seq); > + *seq = READ_ONCE(ifp->if_seq); > xfs_iunlock(ip, XFS_ILOCK_EXCL); > } One of the things that limits xfs_iomap_write_allocate() efficiency is the mitigations for races against truncate. i.e. the huge comment that starts: /* * it is possible that the extents have changed since * we did the read call as we dropped the ilock for a * while. We have to be careful about truncates or hole * punchs here - we are not allowed to allocate * non-delalloc blocks here. .... Now that we can detect that the extents have changed in the data fork, we can go back to allocating multiple extents per xfs_bmapi_write() call by doing a sequence number check after we lock the inode. If the sequence number does not match what was passed in or returned from the previous loop, we return -EAGAIN. Hmmm, looking at the existing -EAGAIN case, I suspect this isn't handled correctly by xfs_map_blocks() anymore. i.e. it just returns the error which can lead to discarding the page rather than checking to see if the there was a valid map allocated. I think there's some followup work here (another patch series). :/ Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx