On Tue, Mar 15, 2022 at 01:03:21PM -0700, Darrick J. Wong wrote: > On Tue, Mar 15, 2022 at 05:42:41PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > I've been chasing a recent resurgence in generic/388 recovery > > failure and/or corruption events. The events have largely been > > recoveryloop, the gift that keeps on giving... *nod* > > The fundamental problem here is that we are using the wrong shutdown > > checks for log items. We've long conflated mount shutdown with log > > shutdown state, and I started separating that recently with the > > atomic shutdown state changes in commit b36d4651e165 ("xfs: make > > forced shutdown processing atomic"). The changes in that commit > > series are directly responsible for being able to diagnose this > > issue because it clearly separated mount shutdown from log shutdown. > > > > Essentially, once we start cancelling writeback of log items and > > removing them from the AIL because the filesystem is shut down, we > > *cannot* update the journal because we may have cancelled the items > > that pin the tail of the log. That moves the tail of the log > > forwards without having written the metadata back, hence we have > > corrupt in memory state and writing to the journal propagates that > > to the on-disk state. > > > > What commit b36d4651e165 makes clear is that log item state needs to > > change relative to log shutdown, not mount shutdown. IOWs, anything > > that aborts metadata writeback needs to check log shutdown state > > because log items directly affect log consistency. Having them check > > mount shutdown state introduces the above race condition where we > > cancel metadata writeback before the log shuts down. > > > > To fix this, this patch works through all log items and converts > > shutdown checks to use xlog_is_shutdown() rather than > > xfs_is_shutdown(), so that we don't start aborting metadata > > writeback before we shut off journal writes. > > Once the log has shut down, is there any reason we shouldn't consider > the filesystem shut down too? > > IOWs, should xfs_is_shutdown be doing something like this: > > bool > xfs_is_shutdown(struct xfs_mount *mp) > { > return test_bit(XFS_OPSTATE_SHUTDOWN, &mp->m_opstate) || > xlog_is_shutdown(mp->m_log); > } Not necessary - the way the shutdown code runs now we guarantee that XFS_OPSTATE_SHUTDOWN is set *before* we set XLOG_IO_ERROR. Hence we'll never see XLOG_IO_ERROR without XFS_OPSTATE_SHUTDOWN. > I could very easily envision myself reintroducing bugs w.r.t. > {xfs,xlog}_is_shutdown because it's not immediately obvious to me > (particularly in xfs_buf.c) which one I ought to use. Yeah, I can't give you a bright line answer to that right now. We've smeared the abstraction between log and mount for a long while now, and the result is that it's not clear what is log and what is mount functionality. > Another way to put this is: what needs to succeed between the point > where we set OPSTATE_SHUTDOWN and XLOG_IO_ERROR? Is the answer to that > "any log IO that was initiated right up until we actually set > XLOG_IO_ERROR"? That's one half - the other half is.... > Which means random parts of the buffer cache, and the > inode/dquot flush code? > > IOWs the log flush and any AIL writeback that was in progress? ... yeah, this. Like the CIL, the AIL belongs to the log, not to the mount. Similarly, log items belong to the log, not the transaction subsystem. The transaction subsystem is the interface layer between the mount and the log - code from above that interacts with transaction knows only about mounts and so they all use xfs_is_shutdown(). The transactions interface with the log via log tickets and log items, which are provided by the log, not the transaction subsystem. Anything that operates on or manages the log, log tickets or log items should typically use xlog_is_shutdown(). This means subsystems that are used both from the mount and log log levels (e.g. xfs_buf.c) has a difficult line to straddle. However, it's worth noting that high level transaction buffer read side does mount shutdown checks (e.g. in xfs_trans_read_buf_map()) and so that largely allows the low level buffer code to only have to care about log level shutdowns. Hence the check in __xfs_buf_submit() is converted to a log level check so that it doesn't abort buffer log item writeback before the log is shut down. Hence I think working out what the right thing to do is short term pain while we work through re-establishing the log vs mount abstractions correctly. I've got various patchsets I've been working on over the past year that clean a fair bit of this this up. However, they are kindai intertwined through the patchsets that provide CIL scalability, intent whiteouts, in-memory iunlink intents, log ticket cleanups, log ticket/grant head scalability (byte tracking, not LSNs), moving AIL push targeting into the AIL instead of setting targets from transaction reservation, moving iclogs behind the CIL and removing log force shenanigans, etc. because I've done cleanups as I've touched various bits of the code... > > @@ -3659,7 +3660,7 @@ xfs_iflush_cluster( > > * AIL, leaving a dirty/unpinned inode attached to the buffer > > * that otherwise looks like it should be flushed. > > */ > > - if (xfs_is_shutdown(mp)) { > > + if (xlog_is_shutdown(mp->m_log)) { > > xfs_iunpin_wait(ip); > > xfs_iflush_abort(ip); > > xfs_iunlock(ip, XFS_ILOCK_SHARED); > > @@ -3685,9 +3686,19 @@ xfs_iflush_cluster( > > } > > > > if (error) { > > + /* > > + * Shutdown first so we kill the log before we release this > > + * buffer. If it is an INODE_ALLOC buffer and pins the tail > > Does inode flush failure leading to immediate shutdown need to happen > with the dquot code too? I /think/ we don't? Because all we do is > remove the dirty flag on the dquot and kill the log? The dquot flush code already does an immediate shutdown on flush failure, too. see xfs_qm_dqflush(): out_abort: dqp->q_flags &= ~XFS_DQFLAG_DIRTY; xfs_trans_ail_delete(lip, 0); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx