Re: [PATCH] xfs: fix livelock in delayed allocation at ENOSPC

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 26 Apr 2023 17:53:33 -0700

On Thu, Apr 27, 2023 at 10:11:24AM +1000, Dave Chinner wrote:
> On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote:
> > > > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > > 
> > > > > On a filesystem with a non-zero stripe unit and a large sequential
> > > > > write, delayed allocation will set a minimum allocation length of
> > > > > the stripe unit. If allocation fails because there are no extents
> > > > > long enough for an aligned minlen allocation, it is supposed to
> > > > > fall back to unaligned allocation which allows single block extents
> > > > > to be allocated.
> > > > > 
> > > > > When the allocator code was rewritting in the 6.3 cycle, this
> > > > > fallback was broken - the old code used args->fsbno as the both the
> > > > > allocation target and the allocation result, the new code passes the
> > > > > target as a separate parameter. The conversion didn't handle the
> > > > > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > > > > the target fsbno on failure which broke allocation failure detection
> > > > > in the high level code and so it never fell back to unaligned
> > > > > allocations.
> > > > > 
> > > > > This resulted in a loop in writeback trying to allocate an aligned
> > > > > block, getting a false positive success, trying to insert the result
> > > > > in the BMBT. This did nothing because the extent already was in the
> > > > > BMBT (merge results in an unchanged extent) and so it returned the
> > > > > prior extent to the conversion code as the current iomap.
> > > > > 
> > > > > Because the iomap returned didn't cover the offset we tried to map,
> > > > > xfs_convert_blocks() then retries the allocation, which fails in the
> > > > > same way and now we have a livelock.
> > > > > 
> > > > > Reported-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > 
> > > > Insofar as this has revealed a whole ton of *more* problems in mkfs,
> > > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > Thanks, I've added this to for-next and I'll include it in the pull
> > > req to Linus tomorrow because I don't want expose everyone using
> > > merge window kernels to this ENOSPC issue even for a short while.
> > > 
> > > > Specifically: if I set su=128k,sw=4, some tests will try to format a
> > > > 512M filesystem.  This results in an 8-AG filesystem with a log that
> > > > fills up almost but not all of an entire AG.  The AG then ends up with
> > > > an empty bnobt and an empty AGFL, and 25 missing blocks...
> > > 
> > > I used su=64k,sw=2 so I didn't see those specific issues. Mostly I
> > > see failures due to mkfs warnings like this:
> > > 
> > >     +Warning: AG size is a multiple of stripe width.  This can cause performance
> > >     +problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
> > >     +an AG size that is one stripe unit smaller or larger, for example 129248.
> > 
> > Yeah, I noticed that one, and am testing a patch to quiet down mkfs a
> > little bit.
> > 
> > I also caught a bug in the AG formatting code where the bnobt gets
> > written out with zero records if the log happens to start beyond
> > m_ag_prealloc_size and end at EOAG.
> > 
> > I also noticed that the percpu inodegc workers occasionally run on the
> > wrong CPU, but only on arm.  Tonight I intend to test a fix for that...
> 
> Whacky. :/
> 
> > ...but I also have been tracking a fix for an issue where
> > xfs_inodegc_stop races with either the reclaim inodegc kicker or with an
> > already set-up delayed work timer, with the end result that
> > drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker
> > itself) tries to queue_work the inodegc worker to the draining
> > workqueue, and we get a kernel bug message and the fs livelocks.
> 
> Yes, I've noticed that but not had time to fix it - disabling the
> inodegc whilst stuff is still in progress after checking inodegc is
> enabled is racy...
> 
> > I've also been trying to fix that problem that Ritesh mentioned months
> > ago where if we manage to mount the fs cleanly but there are unlinked
> > inodes, we'll eventually fall over when the incore unlinked list fails
> > to find those lingering unlinked inodes.
> 
> Right, I also haven't had time to get to that either. IIRC it fails
> to clear the bucket because we don't feed the error from the inodegc
> context back to the recovery code.
> 
> > I also added a su=128k,sw=4 config to the fstests fleet and am now
> > trying to fix all the fstests bugs that produce incorrect test failures.
> 
> The other thing I noticed is a couple of the FIEMAP tests fail
> because they find data blocks where they expect holes such as:
> 
> generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad)
>     --- tests/generic/225.out   2022-12-21 15:53:25.479044361 +1100
>     +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad      2023-04-26 04:24:31.426016818 +1000
>     @@ -1,3 +1,79 @@
>      QA output created by 225
>      fiemap run without preallocation, with sync
>     +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35
>     +ERROR: found an allocated extent where a hole should be: 35
>     +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH'
>     +logical: [      27..      27] phys:       67..      67 flags: 0x000 tot: 1
>     +logical: [      29..      31] phys:       69..      71 flags: 0x000 tot: 3
>     ...
>     (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad'  to see the entire diff)
> 
> I haven't looked into this yet, but nothing is reporting data
> corruptions so I suspect it's just the stripe aligned allocation
> leaving unwritten extents in places the test is expecting holes to
> exist...

That's the FIEMAP tester program not expecting that areas of the file
that it didn't write to can have unwritten extents mapped.  I'm testing
patches to fix all that tonight too.  If I can ever get these %#@%)#%!!!
orchestration scripts to work correctly.

> > > > ...oh and the new test vms that run this config failed to finish for
> > > > some reason.  Sigh.
> > > 
> > > Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing
> > > the xfs_repair process allows everything to keep going. I suspect
> > > it's a prefetch race/deadlock...
> > 
> > <nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock
> > where the pthread mutex says the lock is owned by a thread that is no
> > longer running.
> 
> So we leaked a buffer lock somewhere?

Yep.  It also shows up every now and then in the fstests that fuzz an
entire directory block and see if repair will fix it.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx