On Tue, Dec 04, 2018 at 10:08:33AM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > In xfs_reflink_end_cow, we allocate a single transaction for the entire > end_cow operation and then loop the CoW fork mappings to move them to > the data fork. This design fails on a heavily fragmented filesystem > where an inode's data fork has exactly one more extent than would fit in > an extents-format fork, because the unmap can collapse the data fork > into extents format (freeing the bmbt block) but the remap can expand > the data fork back into a (newly allocated) bmbt block. If the number > of extents we end up remapping is large, we can overflow the block > reservation because we reserved blocks assuming that we were adding > mappings into an already-cleared area of the data fork. > > Let's say we have 8 extents in the data fork, 8 extents in the CoW fork, > and the data fork can hold at most 7 extents before needing to convert > to btree format; and that blocks A-P are discontiguous single-block > extents: > > 0......7 > D: ABCDEFGH > C: IJKLMNOP > > When a write to file blocks 0-7 completes, we must remap I-P into the > data fork. We start by removing H from the btree-format data fork. Now > we have 7 extents, so we convert the fork to extents format, freeing the > bmbt block. We then move P into the data fork and it now has 8 extents > again. We must convert the data fork back to btree format, requiring a > block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0, > we'll need a total of 8 block allocations to remap all 8 blocks. We > reserved only enough blocks to handle one btree split (5 blocks on a 4k > block filesystem), which means we overflow the block reservation. > > To fix this issue, create a separate helper function to remap a single > extent, and change _reflink_end_cow to call it in a tight loop over the > entire range we're completing. As a side effect this also removes the > size restrictions on how many extents we can end_cow at a time, though > nobody ever hit that. It is not reasonable to reserve N blocks to remap > N blocks. > > Note that this can be reproduced after ~320 million fsx ops while > running generic/938 (long soak directio fsx exerciser): > > XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116 > <machine registers snipped> > Call Trace: > xfs_trans_dup+0x211/0x250 [xfs] > xfs_trans_roll+0x6d/0x180 [xfs] > xfs_defer_trans_roll+0x10c/0x3b0 [xfs] > xfs_defer_finish_noroll+0xdf/0x740 [xfs] > xfs_defer_finish+0x13/0x70 [xfs] > xfs_reflink_end_cow+0x2c6/0x680 [xfs] > xfs_dio_write_end_io+0x115/0x220 [xfs] > iomap_dio_complete+0x3f/0x130 > iomap_dio_rw+0x3c3/0x420 > xfs_file_dio_aio_write+0x132/0x3c0 [xfs] > xfs_file_write_iter+0x8b/0xc0 [xfs] > __vfs_write+0x193/0x1f0 > vfs_write+0xba/0x1c0 > ksys_write+0x52/0xc0 > do_syscall_64+0x50/0x160 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > --- Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > fs/xfs/xfs_reflink.c | 232 ++++++++++++++++++++++++++++++-------------------- > 1 file changed, 138 insertions(+), 94 deletions(-) > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c > index 322a852ce284..c5b4fa004ca4 100644 > --- a/fs/xfs/xfs_reflink.c > +++ b/fs/xfs/xfs_reflink.c > @@ -623,54 +623,47 @@ xfs_reflink_cancel_cow_range( > } > > /* > - * Remap parts of a file's data fork after a successful CoW. > + * Remap part of the CoW fork into the data fork. > + * > + * We aim to remap the range starting at @offset_fsb and ending at @end_fsb > + * into the data fork; this function will remap what it can (at the end of the > + * range) and update @end_fsb appropriately. Each remap gets its own > + * transaction because we can end up merging and splitting bmbt blocks for > + * every remap operation and we'd like to keep the block reservation > + * requirements as low as possible. > */ > -int > -xfs_reflink_end_cow( > - struct xfs_inode *ip, > - xfs_off_t offset, > - xfs_off_t count) > +STATIC int > +xfs_reflink_end_cow_extent( > + struct xfs_inode *ip, > + xfs_fileoff_t offset_fsb, > + xfs_fileoff_t *end_fsb) > { > - struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK); > - struct xfs_bmbt_irec got, del; > - struct xfs_trans *tp; > - xfs_fileoff_t offset_fsb; > - xfs_fileoff_t end_fsb; > - int error; > - unsigned int resblks; > - xfs_filblks_t rlen; > - struct xfs_iext_cursor icur; > - > - trace_xfs_reflink_end_cow(ip, offset, count); > + struct xfs_bmbt_irec got, del; > + struct xfs_iext_cursor icur; > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_trans *tp; > + struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK); > + xfs_filblks_t rlen; > + unsigned int resblks; > + int error; > > /* No COW extents? That's easy! */ > - if (ifp->if_bytes == 0) > + if (ifp->if_bytes == 0) { > + *end_fsb = offset_fsb; > return 0; > + } > > - offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset); > - end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count); > + resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, > + XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp); > + if (error) > + return error; > > /* > - * Start a rolling transaction to switch the mappings. We're > - * unlikely ever to have to remap 16T worth of single-block > - * extents, so just cap the worst case extent count to 2^32-1. > - * Stick a warning in just in case, and avoid 64-bit division. > + * Lock the inode. We have to ijoin without automatic unlock because > + * the lead transaction is the refcountbt record deletion; the data > + * fork update follows as a deferred log item. > */ > - BUILD_BUG_ON(MAX_RW_COUNT > UINT_MAX); > - if (end_fsb - offset_fsb > UINT_MAX) { > - error = -EFSCORRUPTED; > - xfs_force_shutdown(ip->i_mount, SHUTDOWN_CORRUPT_INCORE); > - ASSERT(0); > - goto out; > - } > - resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount, > - (unsigned int)(end_fsb - offset_fsb), > - XFS_DATA_FORK); > - error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write, > - resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp); > - if (error) > - goto out; > - > xfs_ilock(ip, XFS_ILOCK_EXCL); > xfs_trans_ijoin(tp, ip, 0); > > @@ -679,80 +672,131 @@ xfs_reflink_end_cow( > * left by the time I/O completes for the loser of the race. In that > * case we are done. > */ > - if (!xfs_iext_lookup_extent_before(ip, ifp, &end_fsb, &icur, &got)) > + if (!xfs_iext_lookup_extent_before(ip, ifp, end_fsb, &icur, &got) || > + got.br_startoff + got.br_blockcount <= offset_fsb) { > + *end_fsb = offset_fsb; > goto out_cancel; > + } > > - /* Walk backwards until we're out of the I/O range... */ > - while (got.br_startoff + got.br_blockcount > offset_fsb) { > - del = got; > - xfs_trim_extent(&del, offset_fsb, end_fsb - offset_fsb); > - > - /* Extent delete may have bumped ext forward */ > - if (!del.br_blockcount) > - goto prev_extent; > + /* > + * Structure copy @got into @del, then trim @del to the range that we > + * were asked to remap. We preserve @got for the eventual CoW fork > + * deletion; from now on @del represents the mapping that we're > + * actually remapping. > + */ > + del = got; > + xfs_trim_extent(&del, offset_fsb, *end_fsb - offset_fsb); > > - /* > - * Only remap real extent that contain data. With AIO > - * speculatively preallocations can leak into the range we > - * are called upon, and we need to skip them. > - */ > - if (!xfs_bmap_is_real_extent(&got)) > - goto prev_extent; > + ASSERT(del.br_blockcount > 0); > > - /* Unmap the old blocks in the data fork. */ > - ASSERT(tp->t_firstblock == NULLFSBLOCK); > - rlen = del.br_blockcount; > - error = __xfs_bunmapi(tp, ip, del.br_startoff, &rlen, 0, 1); > - if (error) > - goto out_cancel; > + /* > + * Only remap real extents that contain data. With AIO, speculative > + * preallocations can leak into the range we are called upon, and we > + * need to skip them. > + */ > + if (!xfs_bmap_is_real_extent(&got)) { > + *end_fsb = del.br_startoff; > + goto out_cancel; > + } > > - /* Trim the extent to whatever got unmapped. */ > - if (rlen) { > - xfs_trim_extent(&del, del.br_startoff + rlen, > - del.br_blockcount - rlen); > - } > - trace_xfs_reflink_cow_remap(ip, &del); > + /* Unmap the old blocks in the data fork. */ > + rlen = del.br_blockcount; > + error = __xfs_bunmapi(tp, ip, del.br_startoff, &rlen, 0, 1); > + if (error) > + goto out_cancel; > > - /* Free the CoW orphan record. */ > - error = xfs_refcount_free_cow_extent(tp, del.br_startblock, > - del.br_blockcount); > - if (error) > - goto out_cancel; > + /* Trim the extent to whatever got unmapped. */ > + xfs_trim_extent(&del, del.br_startoff + rlen, del.br_blockcount - rlen); > + trace_xfs_reflink_cow_remap(ip, &del); > > - /* Map the new blocks into the data fork. */ > - error = xfs_bmap_map_extent(tp, ip, &del); > - if (error) > - goto out_cancel; > + /* Free the CoW orphan record. */ > + error = xfs_refcount_free_cow_extent(tp, del.br_startblock, > + del.br_blockcount); > + if (error) > + goto out_cancel; > > - /* Charge this new data fork mapping to the on-disk quota. */ > - xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT, > - (long)del.br_blockcount); > + /* Map the new blocks into the data fork. */ > + error = xfs_bmap_map_extent(tp, ip, &del); > + if (error) > + goto out_cancel; > > - /* Remove the mapping from the CoW fork. */ > - xfs_bmap_del_extent_cow(ip, &icur, &got, &del); > + /* Charge this new data fork mapping to the on-disk quota. */ > + xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT, > + (long)del.br_blockcount); > > - error = xfs_defer_finish(&tp); > - if (error) > - goto out_cancel; > - if (!xfs_iext_get_extent(ifp, &icur, &got)) > - break; > - continue; > -prev_extent: > - if (!xfs_iext_prev_extent(ifp, &icur, &got)) > - break; > - } > + /* Remove the mapping from the CoW fork. */ > + xfs_bmap_del_extent_cow(ip, &icur, &got, &del); > > error = xfs_trans_commit(tp); > xfs_iunlock(ip, XFS_ILOCK_EXCL); > if (error) > - goto out; > + return error; > + > + /* Update the caller about how much progress we made. */ > + *end_fsb = del.br_startoff; > return 0; > > out_cancel: > xfs_trans_cancel(tp); > xfs_iunlock(ip, XFS_ILOCK_EXCL); > -out: > - trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_); > + return error; > +} > + > +/* > + * Remap parts of a file's data fork after a successful CoW. > + */ > +int > +xfs_reflink_end_cow( > + struct xfs_inode *ip, > + xfs_off_t offset, > + xfs_off_t count) > +{ > + xfs_fileoff_t offset_fsb; > + xfs_fileoff_t end_fsb; > + int error = 0; > + > + trace_xfs_reflink_end_cow(ip, offset, count); > + > + offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset); > + end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count); > + > + /* > + * Walk backwards until we're out of the I/O range. The loop function > + * repeatedly cycles the ILOCK to allocate one transaction per remapped > + * extent. > + * > + * If we're being called by writeback then the the pages will still > + * have PageWriteback set, which prevents races with reflink remapping > + * and truncate. Reflink remapping prevents races with writeback by > + * taking the iolock and mmaplock before flushing the pages and > + * remapping, which means there won't be any further writeback or page > + * cache dirtying until the reflink completes. > + * > + * We should never have two threads issuing writeback for the same file > + * region. There are also have post-eof checks in the writeback > + * preparation code so that we don't bother writing out pages that are > + * about to be truncated. > + * > + * If we're being called as part of directio write completion, the dio > + * count is still elevated, which reflink and truncate will wait for. > + * Reflink remapping takes the iolock and mmaplock and waits for > + * pending dio to finish, which should prevent any directio until the > + * remap completes. Multiple concurrent directio writes to the same > + * region are handled by end_cow processing only occurring for the > + * threads which succeed; the outcome of multiple overlapping direct > + * writes is not well defined anyway. > + * > + * It's possible that a buffered write and a direct write could collide > + * here (the buffered write stumbles in after the dio flushes and > + * invalidates the page cache and immediately queues writeback), but we > + * have never supported this 100%. If either disk write succeeds the > + * blocks will be remapped. > + */ > + while (end_fsb > offset_fsb && !error) > + error = xfs_reflink_end_cow_extent(ip, offset_fsb, &end_fsb); > + > + if (error) > + trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_); > return error; > } >