Re: [PATCH] xfs: fix livelock in delayed allocation at ENOSPC

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 27 Apr 2023 09:01:35 +1000

On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote:
> On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > On a filesystem with a non-zero stripe unit and a large sequential
> > write, delayed allocation will set a minimum allocation length of
> > the stripe unit. If allocation fails because there are no extents
> > long enough for an aligned minlen allocation, it is supposed to
> > fall back to unaligned allocation which allows single block extents
> > to be allocated.
> > 
> > When the allocator code was rewritting in the 6.3 cycle, this
> > fallback was broken - the old code used args->fsbno as the both the
> > allocation target and the allocation result, the new code passes the
> > target as a separate parameter. The conversion didn't handle the
> > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > the target fsbno on failure which broke allocation failure detection
> > in the high level code and so it never fell back to unaligned
> > allocations.
> > 
> > This resulted in a loop in writeback trying to allocate an aligned
> > block, getting a false positive success, trying to insert the result
> > in the BMBT. This did nothing because the extent already was in the
> > BMBT (merge results in an unchanged extent) and so it returned the
> > prior extent to the conversion code as the current iomap.
> > 
> > Because the iomap returned didn't cover the offset we tried to map,
> > xfs_convert_blocks() then retries the allocation, which fails in the
> > same way and now we have a livelock.
> > 
> > Reported-by: Brian Foster <bfoster@xxxxxxxxxx>
> > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Insofar as this has revealed a whole ton of *more* problems in mkfs,
> Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

Thanks, I've added this to for-next and I'll include it in the pull
req to Linus tomorrow because I don't want expose everyone using
merge window kernels to this ENOSPC issue even for a short while.

> Specifically: if I set su=128k,sw=4, some tests will try to format a
> 512M filesystem.  This results in an 8-AG filesystem with a log that
> fills up almost but not all of an entire AG.  The AG then ends up with
> an empty bnobt and an empty AGFL, and 25 missing blocks...

I used su=64k,sw=2 so I didn't see those specific issues. Mostly I
see failures due to mkfs warnings like this:

    +Warning: AG size is a multiple of stripe width.  This can cause performance
    +problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
    +an AG size that is one stripe unit smaller or larger, for example 129248.

> ...oh and the new test vms that run this config failed to finish for
> some reason.  Sigh.

Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing
the xfs_repair process allows everything to keep going. I suspect
it's a prefetch race/deadlock...

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx