Re: [PATCH] xfs: fix soft lockup via spinning in filestream ag selection loop

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 22, 2022 at 09:00:21AM -0700, Darrick J. Wong wrote:
> On Fri, Apr 22, 2022 at 10:12:26AM -0400, Brian Foster wrote:
> > The filestream AG selection loop uses pagf data to aid in AG
> > selection, which depends on pagf initialization. If the in-core
> > structure is not initialized, the caller invokes the AGF read path
> > to do so and carries on. If another task enters the loop and finds
> > a pagf init already in progress, the AGF read returns -EAGAIN and
> > the task continues the loop. This does not increment the current ag
> > index, however, which means the task spins on the current AGF buffer
> > until unlocked.
> > 
> > If the AGF read I/O submitted by the initial task happens to be
> > delayed for whatever reason, this results in soft lockup warnings
> 
> Is there a specific 'whatever reason' going on here?
> 

Presumably.. given this seems to reproduce reliably or not at all in
certain environments/configs, my suspicion was that either the timing of
the test changes enough such that some other task involved with the test
is able to load the bdev, or otherwise timing changes just enough to
trigger the pagf_init race and the subsequent spinning is what
exacerbates the delay (i.e. burning cpu and subsequent soft lockup BUG
starve out some part(s) of the I/O submission/completion processing).
I've no tangible evidence for either aside from the latter seems fairly
logical when you consider that the test consistently completes in 3-4
seconds with the fix in place, but without it we consistently hit
multiple instances of the soft lockup detector (on ~20s intervals IIRC)
and the system seems to melt down indefinitely. *shrug*

Brian

> > via the spinning task. This is reproduced by xfs/170. To avoid this
> > problem, fix the AGF trylock failure path to properly iterate to the
> > next AG. If a task iterates all AGs without making progress, the
> > trylock behavior is dropped in favor of blocking locks and thus a
> > soft lockup is no longer possible.
> > 
> > Fixes: f48e2df8a877ca1c ("xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers")
> 
> Ooops, this was a major braino on my part.
> Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> --D
> 
> > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > ---
> > 
> > I included the Fixes: tag because this looks like a regression in said
> > commit, but I've not explicitly verified.
> > 
> > Brian
> > 
> >  fs/xfs/xfs_filestream.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
> > index 6a3ce0f6dc9e..be9bcf8a1f99 100644
> > --- a/fs/xfs/xfs_filestream.c
> > +++ b/fs/xfs/xfs_filestream.c
> > @@ -128,11 +128,12 @@ xfs_filestream_pick_ag(
> >  		if (!pag->pagf_init) {
> >  			err = xfs_alloc_pagf_init(mp, NULL, ag, trylock);
> >  			if (err) {
> > -				xfs_perag_put(pag);
> > -				if (err != -EAGAIN)
> > +				if (err != -EAGAIN) {
> > +					xfs_perag_put(pag);
> >  					return err;
> > +				}
> >  				/* Couldn't lock the AGF, skip this AG. */
> > -				continue;
> > +				goto next_ag;
> >  			}
> >  		}
> >  
> > -- 
> > 2.34.1
> > 
> 




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux