On Wed, Aug 30, 2023 at 10:13:28AM +1000, Dave Chinner wrote: > On Tue, Aug 29, 2023 at 04:20:43PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > shrikanth hegde reports that filesystems fail shortly after mount with > > the following failure: > > > > WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs] > > > > This of course is the WARN_ON_ONCE in xfs_iunlink_lookup: > > > > ip = radix_tree_lookup(&pag->pag_ici_root, agino); > > if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... } > > > > From diagnostic data collected by the bug reporters, it would appear > > that we cleanly mounted a filesystem that contained unlinked inodes. > > Unlinked inodes are only processed as a final step of log recovery, > > which means that clean mounts do not process the unlinked list at all. > > > > Prior to the introduction of the incore unlinked lists, this wasn't a > > problem because the unlink code would (very expensively) traverse the > > entire ondisk metadata iunlink chain to keep things up to date. > > However, the incore unlinked list code complains when it realizes that > > it is out of sync with the ondisk metadata and shuts down the fs, which > > is bad. > > > > Ritesh proposed to solve this problem by unconditionally parsing the > > unlinked lists at mount time, but this imposes a mount time cost for > > every filesystem to catch something that should be very infrequent. > > Instead, let's target the places where we can encounter a next_unlinked > > pointer that refers to an inode that is not in cache, and load it into > > cache. > > > > Note: This patch does not address the problem of iget loading an inode > > from the middle of the iunlink list and needing to set i_prev_unlinked > > correctly. > > > > Link: https://lore.kernel.org/linux-xfs/e5004868-4a03-93e5-5077-e7ed0e533996@xxxxxxxxxxxxxxxxxx/ > > Reported-by: shrikanth hegde <sshegde@xxxxxxxxxxxxxxxxxx> > > Triaged-by: Ritesh Harjani <ritesh.list@xxxxxxxxx> > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > --- > > fs/xfs/xfs_inode.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++-- > > fs/xfs/xfs_trace.h | 25 +++++++++++++++++++ > > 2 files changed, 92 insertions(+), 3 deletions(-) > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index 6ee266be45d4..3ab140ec09bb 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -1829,12 +1829,17 @@ xfs_iunlink_lookup( > > > > rcu_read_lock(); > > ip = radix_tree_lookup(&pag->pag_ici_root, agino); > > + if (!ip) { > > + /* Caller can handle inode not being in memory. */ > > + rcu_read_unlock(); > > + return NULL; > > + } > > > > /* > > - * Inode not in memory or in RCU freeing limbo should not happen. > > - * Warn about this and let the caller handle the failure. > > + * Inode in RCU freeing limbo should not happen. Warn about this and > > + * let the caller handle the failure. > > */ > > - if (WARN_ON_ONCE(!ip || !ip->i_ino)) { > > + if (WARN_ON_ONCE(!ip->i_ino)) { > > rcu_read_unlock(); > > return NULL; > > } > > I think we should still log a message about this situation, as it implies > that we had an unrecovered unlinked list on the filesystem and that > should "never happen" in normal conditions. > > i.e. something like: > > XFS(dev): Found unrecovered unlinked inodes in AG X. Runtime recovery initiated. > > which uses a perag state flag to only issue the message once per AG > per mount. At least this way, if we get weird stuff happening > because of loading an inode in the middle of an unlinked list (the > unhandled prev_agino case) we know why weird stuff might be > happening... <nod> Ok, I'll make that explicit. > > > @@ -1902,6 +1907,60 @@ xfs_iunlink_update_bucket( > > return 0; > > } > > > > +/* > > + * Load the inode @next_agino into the cache and set its prev_unlinked pointer > > + * to @prev_agino. Caller must hold the AGI to synchronize with other changes > > + * to the unlinked list. > > + */ > > +STATIC int > > +xfs_iunlink_reload_next( > > + struct xfs_trans *tp, > > + struct xfs_buf *agibp, > > + xfs_agino_t prev_agino, > > + xfs_agino_t next_agino) > > +{ > > + struct xfs_perag *pag = agibp->b_pag; > > + struct xfs_mount *mp = pag->pag_mount; > > + struct xfs_inode *next_ip = NULL; > > + xfs_ino_t ino; > > + int error; > > + > > + ASSERT(next_agino != NULLAGINO); > > + > > +#ifdef DEBUG > > + rcu_read_lock(); > > + next_ip = radix_tree_lookup(&pag->pag_ici_root, next_agino); > > + ASSERT(next_ip == NULL); > > + rcu_read_unlock(); > > +#endif > > + > > + ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, next_agino); > > + error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &next_ip); > > + if (error) > > + return error; > > WHy are we using XFS_IGET_UNTRUSTED here? A comment explaining why > we don't trust the agino on th eunlinked list we are about to try to > recover (i.e. trust!) would be good. /* * Use an untrusted lookup just to be cautious in case the AGI * has been corrupted and now points at a free inode. That * shouldn't happen, but we'd rather shut down now since we're * already running in a weird situation. */ > > + /* If this is not an unlinked inode, something is very wrong. */ > > + if (VFS_I(next_ip)->i_nlink != 0) { > > + error = -EFSCORRUPTED; > > + goto rele; > > + } > > *nod* > > > + > > + next_ip->i_prev_unlinked = prev_agino; > > + trace_xfs_iunlink_reload_next(next_ip); > > +rele: > > + /* > > + * We're running in transaction context, so we cannot run any inode > > + * release code. Clear DONTCACHE on this inode to prevent the VFS from > > + * initiating writeback and to force the irele to push this inode to > > + * the LRU instead of dropping it immediately. > > + */ > > + spin_lock(&VFS_I(next_ip)->i_lock); > > + VFS_I(next_ip)->i_state &= ~I_DONTCACHE; > > + spin_unlock(&VFS_I(next_ip)->i_lock); > > + xfs_irele(next_ip); > > Huh. We just loaded the next_ip into memory - how is it dirty, > and what writeback will happen? Also, how would I_DONTCACHE get set > in the first place here? Ah, that's a historical accident -- originally when I thought the possibility of unrecovered unlinked inodes was vanishingly small, I wrote a whole bunch of code into online repair to deal with reloading the incore list, etc. When I first started prototyping it, xchk_irele didn't exist yet, so any time I had to release an inode within a scrub transaction, I had to manually clear I_DONTCACHE. That got copied around everywhere in the scrub code, and then it got copied over when I started working on the runtime version. That's been lurking beyond the depths of djwong-wtf for quite a long time now, and I never got back to it until the heat started going up after 6.1. I think here it's not necessary since (as you point out) nobody can actually dirty the inode, nor can they set DONTCACHE. > > > + return error; > > +} > > + > > static int > > xfs_iunlink_insert_inode( > > struct xfs_trans *tp, > > @@ -1933,6 +1992,8 @@ xfs_iunlink_insert_inode( > > * inode. > > */ > > error = xfs_iunlink_update_backref(pag, agino, next_agino); > > + if (error == -ENOLINK) > > + error = xfs_iunlink_reload_next(tp, agibp, agino, next_agino); > > if (error) > > return error; > > Where does this -ENOLINK error come from? > xfs_iunlink_update_backref() returns either -EFSCORRUPTED or 0. Is > the patch missing hunks or is it dependent on some other patch that > does this? <sigh> I forgot to copy that when I backported this patch from my dev tree to TOT. Welllllp thanks for catching that, now I can go restart the test fleet. /* Update the prev pointer of the next agino. */ static int xfs_iunlink_update_backref( struct xfs_perag *pag, xfs_agino_t prev_agino, xfs_agino_t next_agino) { struct xfs_inode *ip; /* No update necessary if we are at the end of the list. */ if (next_agino == NULLAGINO) return 0; ip = xfs_iunlink_lookup(pag, next_agino); if (!ip) return -ENOLINK; ip->i_prev_unlinked = prev_agino; return 0; } --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx