Re: [PATCH] xfs: split up the xfs_reflink_end_cow work into smaller transactions

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 3 Dec 2018 15:19:29 -0800

On Mon, Dec 03, 2018 at 10:13:12AM -0500, Brian Foster wrote:
> On Fri, Nov 30, 2018 at 11:19:10AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > In xfs_reflink_end_cow, we allocate a single transaction for the entire
> > end_cow operation and then loop the CoW fork mappings to move them to
> > the data fork.  This design fails on a heavily fragmented filesystem
> > where an inode's data fork has exactly one more extent than would fit in
> > an extents-format fork, because the unmap can collapse the data fork
> > into extents format (freeing the bmbt block) but the remap can expand
> > the data fork back into a (newly allocated) bmbt block.  If the number
> > of extents we end up remapping is large, we can overflow the block
> > reservation because we reserved blocks assuming that we were adding
> > mappings into an already-cleared area of the data fork.
> > 
> > Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
> > and the data fork can hold at most 7 extents before needing to convert
> > to btree format; and that blocks A-P are discontiguous single-block
> > extents:
> > 
> >    0......7
> > D: ABCDEFGH
> > C: IJKLMNOP
> > 
> > When a write to file blocks 0-7 completes, we must remap I-P into the
> > data fork.  We start by removing H from the btree-format data fork.  Now
> > we have 7 extents, so we convert the fork to extents format, freeing the
> > bmbt block.   We then move P into the data fork and it now has 8 extents
> > again.  We must convert the data fork back to btree format, requiring a
> > block allocation.  If we repeat this sequence for blocks 6-5-4-3-2-1-0,
> > we'll need a total of 8 block allocations to remap all 8 blocks.  We
> > reserved only enough blocks to handle one btree split (5 blocks on a 4k
> > block filesystem), which means we overflow the block reservation.
> > 
> > To fix this issue, create a separate helper function to remap a single
> > extent, and change _reflink_end_cow to call it in a tight loop over the
> > entire range we're completing.  As a side effect this also removes the
> > size restrictions on how many extents we can end_cow at a time, though
> > nobody ever hit that.  It is not reasonable to reserve N blocks to remap
> > N blocks.
> > 
> > Note that this can be reproduced after ~320 million fsx ops while
> > running generic/938 (long soak directio fsx exerciser):
> > 
> > XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
> > <machine registers snipped>
> > Call Trace:
> >  xfs_trans_dup+0x211/0x250 [xfs]
> >  xfs_trans_roll+0x6d/0x180 [xfs]
> >  xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
> >  xfs_defer_finish_noroll+0xdf/0x740 [xfs]
> >  xfs_defer_finish+0x13/0x70 [xfs]
> >  xfs_reflink_end_cow+0x2c6/0x680 [xfs]
> >  xfs_dio_write_end_io+0x115/0x220 [xfs]
> >  iomap_dio_complete+0x3f/0x130
> >  iomap_dio_rw+0x3c3/0x420
> >  xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
> >  xfs_file_write_iter+0x8b/0xc0 [xfs]
> >  __vfs_write+0x193/0x1f0
> >  vfs_write+0xba/0x1c0
> >  ksys_write+0x52/0xc0
> >  do_syscall_64+0x50/0x160
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_reflink.c |  202 ++++++++++++++++++++++++++------------------------
> >  1 file changed, 105 insertions(+), 97 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 322a852ce284..af0b2da76adf 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -623,53 +623,41 @@ xfs_reflink_cancel_cow_range(
> >  }
> >  
> >  /*
> > - * Remap parts of a file's data fork after a successful CoW.
> > + * Remap part of the CoW fork into the data fork.
> > + *
> > + * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
> > + * into the data fork; this function will remap what it can (at the end of the
> > + * range) and update @end_fsb appropriately.  Each remap gets its own
> > + * transaction because we can end up merging and splitting bmbt blocks for
> > + * every remap operation and we'd like to keep the block reservation
> > + * requirements as low as possible.
> >   */
> > -int
> > -xfs_reflink_end_cow(
> > -	struct xfs_inode		*ip,
> > -	xfs_off_t			offset,
> > -	xfs_off_t			count)
> > +STATIC int
> > +xfs_reflink_end_cow_extent(
> > +	struct xfs_inode	*ip,
> > +	xfs_fileoff_t		offset_fsb,
> > +	xfs_fileoff_t		*end_fsb)
> >  {
> > -	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > -	struct xfs_bmbt_irec		got, del;
> > -	struct xfs_trans		*tp;
> > -	xfs_fileoff_t			offset_fsb;
> > -	xfs_fileoff_t			end_fsb;
> > -	int				error;
> > -	unsigned int			resblks;
> > -	xfs_filblks_t			rlen;
> > -	struct xfs_iext_cursor		icur;
> > -
> > -	trace_xfs_reflink_end_cow(ip, offset, count);
> > +	struct xfs_bmbt_irec	got, del;
> > +	struct xfs_iext_cursor	icur;
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_trans	*tp;
> > +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > +	xfs_filblks_t		rlen;
> > +	unsigned int		resblks;
> > +	int			error;
> >  
> >  	/* No COW extents?  That's easy! */
> > -	if (ifp->if_bytes == 0)
> > +	if (ifp->if_bytes == 0) {
> > +		*end_fsb = offset_fsb;
> >  		return 0;
> > -
> > -	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > -	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > -
> > -	/*
> > -	 * Start a rolling transaction to switch the mappings.  We're
> > -	 * unlikely ever to have to remap 16T worth of single-block
> > -	 * extents, so just cap the worst case extent count to 2^32-1.
> > -	 * Stick a warning in just in case, and avoid 64-bit division.
> > -	 */
> > -	BUILD_BUG_ON(MAX_RW_COUNT > UINT_MAX);
> > -	if (end_fsb - offset_fsb > UINT_MAX) {
> > -		error = -EFSCORRUPTED;
> > -		xfs_force_shutdown(ip->i_mount, SHUTDOWN_CORRUPT_INCORE);
> > -		ASSERT(0);
> > -		goto out;
> >  	}
> > -	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
> > -			(unsigned int)(end_fsb - offset_fsb),
> > -			XFS_DATA_FORK);
> > -	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> > -			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
> > +
> > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> > +			XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
> >  	if (error)
> > -		goto out;
> > +		goto out_err;
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	xfs_trans_ijoin(tp, ip, 0);
> > @@ -679,80 +667,100 @@ xfs_reflink_end_cow(
> >  	 * left by the time I/O completes for the loser of the race.  In that
> >  	 * case we are done.
> >  	 */
> > -	if (!xfs_iext_lookup_extent_before(ip, ifp, &end_fsb, &icur, &got))
> > +	if (!xfs_iext_lookup_extent_before(ip, ifp, end_fsb, &icur, &got) ||
> > +	    got.br_startoff + got.br_blockcount <= offset_fsb) {
> > +		*end_fsb = offset_fsb;
> >  		goto out_cancel;
> > +	}
> >  
> > -	/* Walk backwards until we're out of the I/O range... */
> > -	while (got.br_startoff + got.br_blockcount > offset_fsb) {
> > -		del = got;
> > -		xfs_trim_extent(&del, offset_fsb, end_fsb - offset_fsb);
> > -
> > -		/* Extent delete may have bumped ext forward */
> > -		if (!del.br_blockcount)
> > -			goto prev_extent;
> > +	/*
> > +	 * Structure copy @got into @del, then trim @del to the range that we
> > +	 * were asked to remap.  We preserve @got for the eventual CoW fork
> > +	 * deletion; from now on @del represents the mapping that we're
> > +	 * actually remapping.
> > +	 */
> > +	del = got;
> > +	xfs_trim_extent(&del, offset_fsb, *end_fsb - offset_fsb);
> >  
> > -		/*
> > -		 * Only remap real extent that contain data.  With AIO
> > -		 * speculatively preallocations can leak into the range we
> > -		 * are called upon, and we need to skip them.
> > -		 */
> > -		if (!xfs_bmap_is_real_extent(&got))
> > -			goto prev_extent;
> > +	ASSERT(del.br_blockcount > 0);
> >  
> > -		/* Unmap the old blocks in the data fork. */
> > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > -		rlen = del.br_blockcount;
> > -		error = __xfs_bunmapi(tp, ip, del.br_startoff, &rlen, 0, 1);
> > -		if (error)
> > -			goto out_cancel;
> > +	/*
> > +	 * Only remap real extents that contain data.  With AIO, speculative
> > +	 * preallocations can leak into the range we are called upon, and we
> > +	 * need to skip them.
> > +	 */
> > +	if (!xfs_bmap_is_real_extent(&got)) {
> > +		*end_fsb = del.br_startoff;
> > +		goto out_cancel;
> > +	}
> >  
> > -		/* Trim the extent to whatever got unmapped. */
> > -		if (rlen) {
> > -			xfs_trim_extent(&del, del.br_startoff + rlen,
> > -				del.br_blockcount - rlen);
> > -		}
> > -		trace_xfs_reflink_cow_remap(ip, &del);
> > +	/* Unmap the old blocks in the data fork. */
> > +	rlen = del.br_blockcount;
> > +	error = __xfs_bunmapi(tp, ip, del.br_startoff, &rlen, 0, 1);
> > +	if (error)
> > +		goto out_cancel;
> >  
> > -		/* Free the CoW orphan record. */
> > -		error = xfs_refcount_free_cow_extent(tp, del.br_startblock,
> > -				del.br_blockcount);
> > -		if (error)
> > -			goto out_cancel;
> > +	/* Trim the extent to whatever got unmapped. */
> > +	xfs_trim_extent(&del, del.br_startoff + rlen, del.br_blockcount - rlen);
> > +	trace_xfs_reflink_cow_remap(ip, &del);
> >  
> > -		/* Map the new blocks into the data fork. */
> > -		error = xfs_bmap_map_extent(tp, ip, &del);
> > -		if (error)
> > -			goto out_cancel;
> > +	/* Free the CoW orphan record. */
> > +	error = xfs_refcount_free_cow_extent(tp, del.br_startblock,
> > +			del.br_blockcount);
> > +	if (error)
> > +		goto out_cancel;
> >  
> > -		/* Charge this new data fork mapping to the on-disk quota. */
> > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT,
> > -				(long)del.br_blockcount);
> > +	/* Map the new blocks into the data fork. */
> > +	error = xfs_bmap_map_extent(tp, ip, &del);
> > +	if (error)
> > +		goto out_cancel;
> >  
> > -		/* Remove the mapping from the CoW fork. */
> > -		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > +	/* Charge this new data fork mapping to the on-disk quota. */
> > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT,
> > +			(long)del.br_blockcount);
> >  
> > -		error = xfs_defer_finish(&tp);
> > -		if (error)
> > -			goto out_cancel;
> > -		if (!xfs_iext_get_extent(ifp, &icur, &got))
> > -			break;
> > -		continue;
> > -prev_extent:
> > -		if (!xfs_iext_prev_extent(ifp, &icur, &got))
> > -			break;
> > -	}
> > +	/* Remove the mapping from the CoW fork. */
> > +	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> >  
> >  	error = xfs_trans_commit(tp);
> > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	if (error)
> > -		goto out;
> > -	return 0;
> > +		goto out_unlock;
> >  
> > +	/* Update the caller about how much progress we made. */
> > +	*end_fsb = del.br_startoff;
> > +	goto out_unlock;
> >  out_cancel:
> >  	xfs_trans_cancel(tp);
> > +out_unlock:
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -out:
> > -	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > +out_err:
> > +	return error;
> > +}
> > +
> 
> Nit, but the error handling/labels could be cleaned up a bit here. The
> ilock can be transferred to the transaction to eliminate that label, the
> trans alloc and commit sites can then just return directly to eliminate
> out_err, and we're left with a single out_cancel label (and no longer
> need a goto in the common return path).

Ok, will do.

> > +/*
> > + * Remap parts of a file's data fork after a successful CoW.
> > + */
> > +int
> > +xfs_reflink_end_cow(
> > +	struct xfs_inode		*ip,
> > +	xfs_off_t			offset,
> > +	xfs_off_t			count)
> > +{
> > +	xfs_fileoff_t			offset_fsb;
> > +	xfs_fileoff_t			end_fsb;
> > +	int				error = 0;
> > +
> > +	trace_xfs_reflink_end_cow(ip, offset, count);
> > +
> > +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > +
> > +	/* Walk backwards until we're out of the I/O range... */
> > +	while (end_fsb > offset_fsb && !error)
> > +		error = xfs_reflink_end_cow_extent(ip, offset_fsb, &end_fsb);
> > +
> 
> The high level change seems reasonable provided we've thought through
> whether it's Ok to cycle the ILOCK around this operation. For example,
> maybe this is not possible, but I'm wondering if the end I/O processing
> could technically race with something crazy across a lock cycle like a
> truncate followed by another reflink -> COW writeback that lands in the
> same target range of the file.

I think the answer to this is that reflink always flushes both ranges
and truncates the pagecache (with iolock/mmaplock held) before starting
the remapping, and pagecache truncation waits for PageWriteback pages,
which means that we shouldn't end up reflinking into a range that's
undergoing writeback.

As far as reflink + directio writes go, reflink also waits for all dio
to finish so there shouldn't be a race there, and the dio write will
flush dirty pages and invalidate the pagecache before it starts the dio,
so I think we're (mostly) protected from races there, insofar as we
don't 100% support buffered and dio writes to the same parts of a
file....

> I guess we'd only remap COW blocks if they are "real," but that
> conversion looks like it happens on I/O submission, so what if that I/O
> happens to fail? Could we end up with racing success/fail COW I/O
> completions somehow or another?

...so I think that leaves only the potential for a race between two
directio writes to the same region where one of them succeeds and the
other fails.  I /thought/ that there could be a race there, but I
noticed that on dio error we no longer _reflink_cancel_cow, we simply
don't _reflink_remap_cow.  So I think we're ok there too.  If both both
dio writes succeed then one or the other of them will do the remapping.

(I'm not sure how we define the expected outcome of simultaneous dio
writes to the same part of a file, but that seems rather suspect to
me...)

--D

> (Conversely, if the ilock approach is safe here then a quick note in the
> commit log as to why would be useful.)

> Brian
> 
> > +	if (error)
> > +		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> >  	return error;
> >  }
> >