Re: [PATCH] xfs: fix indlen accounting error on partial delalloc conversion

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 9 May 2017 12:32:13 -0400

On Tue, May 09, 2017 at 09:09:48AM -0700, Darrick J. Wong wrote:
> On Wed, May 03, 2017 at 02:46:26PM -0400, Brian Foster wrote:
> > The delalloc -> real block conversion path uses an incorrect
> > calculation in the case where the middle part of a delalloc extent
> > is being converted. This is documented as a rare situation because
> > XFS generally attempts to maximize contiguity by converting as much
> > of a delalloc extent as possible.
> > 
> > If this situation does occur, the indlen reservation for the two new
> > delalloc extents left behind by the conversion of the middle range
> > is calculated and compared with the original reservation. If more
> > blocks are required, the delta is allocated from the global block
> > pool. This delta value can be characterized as the difference
> > between the new total requirement (temp + temp2) and the currently
> > available reservation minus those blocks that have already been
> > allocated (startblockval(PREV.br_startblock) - allocated).
> > 
> > The problem is that the current code does not account for previously
> > allocated blocks correctly. It subtracts the current allocation
> > count from the (new - old) delta rather than the old indlen
> > reservation. This means that more indlen blocks than have been
> > allocated end up stashed in the remaining extents and free space
> > accounting is broken as a result.
> > 
> > Fix up the calculation to subtract the allocated block count from
> > the original extent indlen and thus correctly allocate the
> > reservation delta based on the difference between the new total
> > requirement and the unused blocks from the original reservation.
> > Also remove a bogus assert that contradicts the fact that the new
> > indlen reservation can be larger than the original indlen
> > reservation.
> > 
> > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > ---
> > 
> > Hi all,
> > 
> > So while I fortunately was able to reproduce this problem and verify the
> > patch, I'm not able to do so in any way that facilitates a generic test
> > case. To reproduce, I had to hack up the allocation path to always do
> > single physical block allocations and otherwise not attempt to convert
> > whole delalloc extents, hack up an ioctl to do a
> > filemap_write_and_wait_range(), and run a test that repeatedly does 1M
> > buffered writes followed by fragmented, page-sized writebacks of that 1M
> > extent. This forces partial (middle) extent conversions until the extent
> > tree grows enough to require a bmbt block allocation. When that occurs,
> > I always end up with the following free space accounting inconsistency
> > reported via xfs_repair:
> > 
> > Phase 1 - find and verify superblock...     
> > Phase 2 - using internal log
> >         - zero log...               
> >         - scan filesystem freespace and inode maps...
> > sb_fdblocks 3929316, counted 3929314             
> > ...
> > 
> > With this patch applied, the same test no longer reproduces the
> > corruption. This otherwise survives my xfstests testing (without the
> > aforementioned hacks). Thoughts, reviews, flames appreciated.
> > 
> > Brian
> > 
> >  fs/xfs/libxfs/xfs_bmap.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index f02eb76..8adb91b 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -2065,8 +2065,10 @@ xfs_bmap_add_extent_delay_real(
> >  		}
> >  		temp = xfs_bmap_worst_indlen(bma->ip, temp);
> >  		temp2 = xfs_bmap_worst_indlen(bma->ip, temp2);
> > -		diff = (int)(temp + temp2 - startblockval(PREV.br_startblock) -
> > -			(bma->cur ? bma->cur->bc_private.b.allocated : 0));
> > +		diff = (int)(temp + temp2 -
> > +			     (startblockval(PREV.br_startblock) -
> > +			      (bma->cur ?
> > +			       bma->cur->bc_private.b.allocated : 0)));
> 
> <thinking aloud again>
> 
> Prior to this operation we made a delalloc reservation in which we
> reserved a number of blocks for bmbt expansion by subtracting that
> quantity from incore fdblocks; that is startblockval(PREV.br_startblock).
> Then we did some bmbt magic which used up cur->bc_private.b.allocated
> blocks from the reservation.  Since those blocks were already subtracted
> from the incore fdblocks, we don't need to do that now.  After the magic
> was over, we now have two smaller delalloc reservations (with a real
> extent in the middle), each with their own reservations for bmbt shape
> changing expansion.
> 
> Now we aim to make the accounting of incore fdblocks square again by
> checking if we are now hanging on to more blocks than we were before and
> subtracting the difference from fdblocks if so.  (We deal separately with
> the possibility of hanging onto /fewer/ blocks than before.)
> 
> Ok, so first we reserved startblockval(PREV.br_startblock).
> 
> After splitting the da extent in the middle we now have two da extents;
> the effective indlen reservation of the whole range we just touched is:
> xfs_bmap_worst_indlen(bma->ip, temp) + xfs_bmap_worst_indlen(bma->ip, temp2);
> 
> Then we (may have) fed bma->cur->bc_private.b.allocated blocks to the bmbt.
> The number of blocks we now know about is:
> 
> xfs_bmap_worst_indlen(bma->ip, temp) +
> xfs_bmap_worst_indlen(bma->ip, temp2) +
> bma->cur->bc_private.b.allocated
> 
> Therefore, the number of blocks we are over (if at all) is:
> 
> xfs_bmap_worst_indlen(bma->ip, temp) +
> xfs_bmap_worst_indlen(bma->ip, temp2) +
> bma->cur->bc_private.b.allocated -
> startblockval(PREV.br_startblock)
> 
> Or, put another way:
> 
> (xfs_bmap_worst_indlen(bma->ip, temp) +
>  xfs_bmap_worst_indlen(bma->ip, temp2)) -
> (startblockval(PREV.br_startblock) -
>  bma->cur->bc_private.b.allocated)
> 
> (Exactly what Brian wrote above.)
> 
> If that quantity is positive, we're now hanging on to more blocks than
> we were before, so we subtract that many from fdblocks.  That part looks
> ok, though I have one question -- why is diff an int?  temp/temp2 are
> both xfs_filblks_t (__uint64_t), so the whole expression is promoted to
> a 64-bit operation, then the results are squashed down into a 32-bit
> int, which is then cast back to int64_t.  With our short MAXEXTLEN I
> think it's impossible ever to have diff > 2^32, but the casting back and
> forth caught my eye.
> 

As to why it is signed, I'd guess that either it was coded that way to
avoid a separate conditional or perhaps diff was previously used
elsewhere in this function..? I'm not really sure why it uses 32-bit vs.
64-bit tbh. (I don't think it matters, but I'm fine if you want to use
an int64_t or something and clean up the casting).

> Aside from that, it looks ok...
> Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> 

Thanks for review.

Brian

> --D
> 
> >  		if (diff > 0) {
> >  			error = xfs_mod_fdblocks(bma->ip->i_mount,
> >  						 -((int64_t)diff), false);
> > @@ -2123,7 +2125,6 @@ xfs_bmap_add_extent_delay_real(
> >  		temp = da_new;
> >  		if (bma->cur)
> >  			temp += bma->cur->bc_private.b.allocated;
> > -		ASSERT(temp <= da_old);
> >  		if (temp < da_old)
> >  			xfs_mod_fdblocks(bma->ip->i_mount,
> >  					(int64_t)(da_old - temp), false);
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html