Re: [RFC PATCH] xfs: load uncached unlinked inodes into memory on demand

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 29 Aug 2023 19:04:30 -0700

On Wed, Aug 30, 2023 at 10:13:28AM +1000, Dave Chinner wrote:
> On Tue, Aug 29, 2023 at 04:20:43PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > shrikanth hegde reports that filesystems fail shortly after mount with
> > the following failure:
> > 
> > 	WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]
> > 
> > This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:
> > 
> > 	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> > 	if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }
> > 
> > From diagnostic data collected by the bug reporters, it would appear
> > that we cleanly mounted a filesystem that contained unlinked inodes.
> > Unlinked inodes are only processed as a final step of log recovery,
> > which means that clean mounts do not process the unlinked list at all.
> > 
> > Prior to the introduction of the incore unlinked lists, this wasn't a
> > problem because the unlink code would (very expensively) traverse the
> > entire ondisk metadata iunlink chain to keep things up to date.
> > However, the incore unlinked list code complains when it realizes that
> > it is out of sync with the ondisk metadata and shuts down the fs, which
> > is bad.
> > 
> > Ritesh proposed to solve this problem by unconditionally parsing the
> > unlinked lists at mount time, but this imposes a mount time cost for
> > every filesystem to catch something that should be very infrequent.
> > Instead, let's target the places where we can encounter a next_unlinked
> > pointer that refers to an inode that is not in cache, and load it into
> > cache.
> > 
> > Note: This patch does not address the problem of iget loading an inode
> > from the middle of the iunlink list and needing to set i_prev_unlinked
> > correctly.
> > 
> > Link: https://lore.kernel.org/linux-xfs/e5004868-4a03-93e5-5077-e7ed0e533996@xxxxxxxxxxxxxxxxxx/
> > Reported-by: shrikanth hegde <sshegde@xxxxxxxxxxxxxxxxxx>
> > Triaged-by: Ritesh Harjani <ritesh.list@xxxxxxxxx>
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_inode.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  fs/xfs/xfs_trace.h |   25 +++++++++++++++++++
> >  2 files changed, 92 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 6ee266be45d4..3ab140ec09bb 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1829,12 +1829,17 @@ xfs_iunlink_lookup(
> >  
> >  	rcu_read_lock();
> >  	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> > +	if (!ip) {
> > +		/* Caller can handle inode not being in memory. */
> > +		rcu_read_unlock();
> > +		return NULL;
> > +	}
> >  
> >  	/*
> > -	 * Inode not in memory or in RCU freeing limbo should not happen.
> > -	 * Warn about this and let the caller handle the failure.
> > +	 * Inode in RCU freeing limbo should not happen.  Warn about this and
> > +	 * let the caller handle the failure.
> >  	 */
> > -	if (WARN_ON_ONCE(!ip || !ip->i_ino)) {
> > +	if (WARN_ON_ONCE(!ip->i_ino)) {
> >  		rcu_read_unlock();
> >  		return NULL;
> >  	}
> 
> I think we should still log a message about this situation, as it implies
> that we had an unrecovered unlinked list on the filesystem and that
> should "never happen" in normal conditions.
> 
> i.e. something like:
> 
> XFS(dev): Found unrecovered unlinked inodes in AG X. Runtime recovery initiated.
> 
> which uses a perag state flag to only issue the message once per AG
> per mount. At least this way, if we get weird stuff happening
> because of loading an inode in the middle of an unlinked list (the
> unhandled prev_agino case) we know why weird stuff might be
> happening...

<nod> Ok, I'll make that explicit.

> 
> > @@ -1902,6 +1907,60 @@ xfs_iunlink_update_bucket(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Load the inode @next_agino into the cache and set its prev_unlinked pointer
> > + * to @prev_agino.  Caller must hold the AGI to synchronize with other changes
> > + * to the unlinked list.
> > + */
> > +STATIC int
> > +xfs_iunlink_reload_next(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_buf		*agibp,
> > +	xfs_agino_t		prev_agino,
> > +	xfs_agino_t		next_agino)
> > +{
> > +	struct xfs_perag	*pag = agibp->b_pag;
> > +	struct xfs_mount	*mp = pag->pag_mount;
> > +	struct xfs_inode	*next_ip = NULL;
> > +	xfs_ino_t		ino;
> > +	int			error;
> > +
> > +	ASSERT(next_agino != NULLAGINO);
> > +
> > +#ifdef DEBUG
> > +	rcu_read_lock();
> > +	next_ip = radix_tree_lookup(&pag->pag_ici_root, next_agino);
> > +	ASSERT(next_ip == NULL);
> > +	rcu_read_unlock();
> > +#endif
> > +
> > +	ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, next_agino);
> > +	error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &next_ip);
> > +	if (error)
> > +		return error;
> 
> WHy are we using XFS_IGET_UNTRUSTED here? A comment explaining why
> we don't trust the agino on th eunlinked list we are about to try to
> recover (i.e. trust!) would be good.

	/*
	 * Use an untrusted lookup just to be cautious in case the AGI
	 * has been corrupted and now points at a free inode.  That
	 * shouldn't happen, but we'd rather shut down now since we're
	 * already running in a weird situation.
	 */

> > +	/* If this is not an unlinked inode, something is very wrong. */
> > +	if (VFS_I(next_ip)->i_nlink != 0) {
> > +		error = -EFSCORRUPTED;
> > +		goto rele;
> > +	}
> 
> *nod*
> 
> > +
> > +	next_ip->i_prev_unlinked = prev_agino;
> > +	trace_xfs_iunlink_reload_next(next_ip);
> > +rele:
> > +	/*
> > +	 * We're running in transaction context, so we cannot run any inode
> > +	 * release code.  Clear DONTCACHE on this inode to prevent the VFS from
> > +	 * initiating writeback and to force the irele to push this inode to
> > +	 * the LRU instead of dropping it immediately.
> > +	 */
> > +	spin_lock(&VFS_I(next_ip)->i_lock);
> > +	VFS_I(next_ip)->i_state &= ~I_DONTCACHE;
> > +	spin_unlock(&VFS_I(next_ip)->i_lock);
> > +	xfs_irele(next_ip);
> 
> Huh. We just loaded the next_ip into memory - how is it dirty,
> and what writeback will happen? Also, how would I_DONTCACHE get set
> in the first place here?

Ah, that's a historical accident -- originally when I thought the
possibility of unrecovered unlinked inodes was vanishingly small, I
wrote a whole bunch of code into online repair to deal with reloading
the incore list, etc.

When I first started prototyping it, xchk_irele didn't exist yet, so any
time I had to release an inode within a scrub transaction, I had to
manually clear I_DONTCACHE.  That got copied around everywhere in the
scrub code, and then it got copied over when I started working on the
runtime version.  That's been lurking beyond the depths of djwong-wtf
for quite a long time now, and I never got back to it until the heat
started going up after 6.1.

I think here it's not necessary since (as you point out) nobody can
actually dirty the inode, nor can they set DONTCACHE.

> 
> > +	return error;
> > +}
> > +
> >  static int
> >  xfs_iunlink_insert_inode(
> >  	struct xfs_trans	*tp,
> > @@ -1933,6 +1992,8 @@ xfs_iunlink_insert_inode(
> >  	 * inode.
> >  	 */
> >  	error = xfs_iunlink_update_backref(pag, agino, next_agino);
> > +	if (error == -ENOLINK)
> > +		error = xfs_iunlink_reload_next(tp, agibp, agino, next_agino);
> >  	if (error)
> >  		return error;
> 
> Where does this -ENOLINK error come from?
> xfs_iunlink_update_backref() returns either -EFSCORRUPTED or 0. Is
> the patch missing hunks or is it dependent on some other patch that
> does this?

<sigh> I forgot to copy that when I backported this patch from my dev
tree to TOT.  Welllllp thanks for catching that, now I can go restart
the test fleet.

/* Update the prev pointer of the next agino. */
static int
xfs_iunlink_update_backref(
	struct xfs_perag	*pag,
	xfs_agino_t		prev_agino,
	xfs_agino_t		next_agino)
{
	struct xfs_inode	*ip;

	/* No update necessary if we are at the end of the list. */
	if (next_agino == NULLAGINO)
		return 0;

	ip = xfs_iunlink_lookup(pag, next_agino);
	if (!ip)
		return -ENOLINK;

	ip->i_prev_unlinked = prev_agino;
	return 0;
}

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx