Re: [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 10 Jan 2024 08:43:31 +1100

On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote:
> > 
> > The problem we need to solve is how we preserve the necessary
> > anti-recovery behaviour when we have multiple checkpoints that can
> > have the same LSN and objects are updated immediately on recovery?
> > 
> > I suspect that we need to track that the checkpoint being recovered
> > has a duplicate start LSN (i.e. in the struct xlog_recover) and
> > modify the anti-recovery LSN check to take this into account. i.e.
> > we can really only skip recovery of the first checkpoint at any
> > given LSN because we cannot disambiguate an LSN updated by the first
> > checkpoint at that LSN and the metadata already being up to date on
> > disk in the second and subsequent checkpoints at the same start
> > LSN.
> > 
> > There are likely to be other solutions - anyone have a different
> > idea on how we might address this?
> > 
> 
> It's been a while since I've looked at any of this and I haven't waded
> through all of the details, so I could easily be missing something, but
> what exactly is wrong with the approach of the patch as posted?

That it fails to address the fact that the code as implemented
violates the "only submit buffers on LSN change" invariant. Hence we
have silent failure to recover of the second set of changes
to a log item  recorded in the multiple checkpoints that have the
same start LSN.

The original problem described in the commit - a shutdown due to a
freespace btree record corruption - has been something we've seen
semi-regularly for a few years now. We've never got to the
bottom of the problem because we've lacked a reliable reproducer for
the issue.

The analysis and debug information provided by out by Long indicates
that when multiple checkpoints start at the same LSN, the objects in
the later checkpoints (based on commit record ordering) won't get
replayed because the LSN in the object has already been updated by
the first checkpoint. Hence they skip recovery in the second (and
subsequent) checkpoints at the same start LSN.

In a lot of these cases, the object will be logged again later in
the recovery process, thereby overwriting the corruption caused by
skipping a checkpointed update. Hence this will only be exposed in
normal situations if the silent recovery failure occurs on the last
modification of the object in the journal.

This is why it's a rare failure to be seen in production systems,
but it is something that hindsight tells us has been occurring given
the repeated reports of unexplainable single record free space btree
corruption we've had over the past few years.

> Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
> boundaries") basically created a new invariant for log recovery where
> buffers are allowed to be written only once per LSN. The risk otherwise
> is that a subsequent update with a matching LSN would not be correctly
> applied due to the v5 LSN ordering rules. Since log recovery processes
> transactions (using terminology/granularity as defined by the
> implementation of xlog_recover_commit_trans()), this required changes to
> accommodate any of the various possible runtime logging scenarios that
> could cause a buffer to have multiple entries in the log associated with
> a single LSN, the details of which were orthogonal to the fix.
> 
> The functional change therefore was that rather than to process and
> submit "transactions" in sequence during recovery, the pending buffer
> list was lifted to a higher level in the code, a tracking field was
> added for the "current LSN" of log recovery, and only once we cross a
> current LSN boundary are we allowed to submit the set of buffers
> processed for the prior LSN. The reason for this logic is that seeing
> the next LSN was really the only way we know we're done processing items
> for a particular LSN.

Yes, and therein lies one of the problems with the current
implementation - this "lsn has changed" logic is incorrect.

> If I understand the problem description correctly, the issue here is
> that if an error is encountered in the middle of processing items for
> some LSN A, we bail out of recovery and submit the pending buffers on
> the way out.  If we haven't completed processing all items for LSN A
> before failing, however, then we've just possibly violated the "write
> once per LSN" invariant that protects from corrupting the fs.

The error handling and/or repeated runs of log recovery simply
exposes the problem - these symptoms are not the problem that needs
to be fixed.

The issue is that the code as it stands doesn't handle object
recovery from multiple checkpoints with the same start lsn. The
easiest way to understand this is to look at the buffer submit logic
on completion of a checkpoint:

	if (log->l_recovery_lsn != trans->r_lsn &&
            ohead->oh_flags & XLOG_COMMIT_TRANS) {
                error = xfs_buf_delwri_submit(buffer_list);
                if (error)
                        return error;
                log->l_recovery_lsn = trans->r_lsn;
        }

This submits the buffer list on the first checkpoint that completes
with a new start LSN, not when all the checkpoints with the same
start LSN complete. i.e.:

checkpoint  start LSN	commit lsn	submission on commit record
A		32	  63		buffer list for A
B		64	  68		buffer list for B
C		64	  92		nothing, start lsn unchanged
D		64	 127		nothing, start lsn unchanged
E		128	 192		buffer list for C, D and E

IOWs, the invariant "don't submit buffers until LSN changes" is not
actually implemented correctly by this code. This is the obvious
aspect of the problem, but addressing buffer submission doesn't
actually fix the problem.

That is, changing buffer submission to be correct doesn't address
the fact that we've already done things like updated the LSN in
inodes and dquots during recovery of those objects. Hence,
regardless of whether we submit the buffers or not, changes to
non-buffer objects in checkpoints C and D will never get recovered
directly if they were originally modified in checkpoint B.

This is the problem we need to address: if we have multiple
checkpoints at the same start LSN, we need to ensure that all the
changes to any object in any of the checkpoints at that start LSN
are recovered. This is what we are not doing, and this is the root
cause of the problem....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx