Re: [PATCH] xfs: fix livelock in delayed allocation at ENOSPC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote:
> On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > On a filesystem with a non-zero stripe unit and a large sequential
> > > write, delayed allocation will set a minimum allocation length of
> > > the stripe unit. If allocation fails because there are no extents
> > > long enough for an aligned minlen allocation, it is supposed to
> > > fall back to unaligned allocation which allows single block extents
> > > to be allocated.
> > > 
> > > When the allocator code was rewritting in the 6.3 cycle, this
> > > fallback was broken - the old code used args->fsbno as the both the
> > > allocation target and the allocation result, the new code passes the
> > > target as a separate parameter. The conversion didn't handle the
> > > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > > the target fsbno on failure which broke allocation failure detection
> > > in the high level code and so it never fell back to unaligned
> > > allocations.
> > > 
> > > This resulted in a loop in writeback trying to allocate an aligned
> > > block, getting a false positive success, trying to insert the result
> > > in the BMBT. This did nothing because the extent already was in the
> > > BMBT (merge results in an unchanged extent) and so it returned the
> > > prior extent to the conversion code as the current iomap.
> > > 
> > > Because the iomap returned didn't cover the offset we tried to map,
> > > xfs_convert_blocks() then retries the allocation, which fails in the
> > > same way and now we have a livelock.
> > > 
> > > Reported-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Insofar as this has revealed a whole ton of *more* problems in mkfs,
> > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> Thanks, I've added this to for-next and I'll include it in the pull
> req to Linus tomorrow because I don't want expose everyone using
> merge window kernels to this ENOSPC issue even for a short while.
> 
> > Specifically: if I set su=128k,sw=4, some tests will try to format a
> > 512M filesystem.  This results in an 8-AG filesystem with a log that
> > fills up almost but not all of an entire AG.  The AG then ends up with
> > an empty bnobt and an empty AGFL, and 25 missing blocks...
> 
> I used su=64k,sw=2 so I didn't see those specific issues. Mostly I
> see failures due to mkfs warnings like this:
> 
>     +Warning: AG size is a multiple of stripe width.  This can cause performance
>     +problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
>     +an AG size that is one stripe unit smaller or larger, for example 129248.

Yeah, I noticed that one, and am testing a patch to quiet down mkfs a
little bit.

I also caught a bug in the AG formatting code where the bnobt gets
written out with zero records if the log happens to start beyond
m_ag_prealloc_size and end at EOAG.

I also noticed that the percpu inodegc workers occasionally run on the
wrong CPU, but only on arm.  Tonight I intend to test a fix for that...

...but I also have been tracking a fix for an issue where
xfs_inodegc_stop races with either the reclaim inodegc kicker or with an
already set-up delayed work timer, with the end result that
drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker
itself) tries to queue_work the inodegc worker to the draining
workqueue, and we get a kernel bug message and the fs livelocks.

I've also been trying to fix that problem that Ritesh mentioned months
ago where if we manage to mount the fs cleanly but there are unlinked
inodes, we'll eventually fall over when the incore unlinked list fails
to find those lingering unlinked inodes.

I also added a su=128k,sw=4 config to the fstests fleet and am now
trying to fix all the fstests bugs that produce incorrect test failures.

> > ...oh and the new test vms that run this config failed to finish for
> > some reason.  Sigh.
> 
> Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing
> the xfs_repair process allows everything to keep going. I suspect
> it's a prefetch race/deadlock...

<nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock
where the pthread mutex says the lock is owned by a thread that is no
longer running.

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux