Re: [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers

Long Li <leo.lilong@xxxxxxxxxx> · Wed, 10 Jan 2024 15:03:24 +0800

On Wed, Jan 10, 2024 at 08:43:31AM +1100, Dave Chinner wrote:
> On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote:
> > > 
> > > The problem we need to solve is how we preserve the necessary
> > > anti-recovery behaviour when we have multiple checkpoints that can
> > > have the same LSN and objects are updated immediately on recovery?
> > > 
> > > I suspect that we need to track that the checkpoint being recovered
> > > has a duplicate start LSN (i.e. in the struct xlog_recover) and
> > > modify the anti-recovery LSN check to take this into account. i.e.
> > > we can really only skip recovery of the first checkpoint at any
> > > given LSN because we cannot disambiguate an LSN updated by the first
> > > checkpoint at that LSN and the metadata already being up to date on
> > > disk in the second and subsequent checkpoints at the same start
> > > LSN.
> > > 
> > > There are likely to be other solutions - anyone have a different
> > > idea on how we might address this?
> > > 
> > 
> > It's been a while since I've looked at any of this and I haven't waded
> > through all of the details, so I could easily be missing something, but
> > what exactly is wrong with the approach of the patch as posted?
> 
> That it fails to address the fact that the code as implemented
> violates the "only submit buffers on LSN change" invariant. Hence we
> have silent failure to recover of the second set of changes
> to a log item  recorded in the multiple checkpoints that have the
> same start LSN.
> 
> The original problem described in the commit - a shutdown due to a
> freespace btree record corruption - has been something we've seen
> semi-regularly for a few years now. We've never got to the
> bottom of the problem because we've lacked a reliable reproducer for
> the issue.
> 
> The analysis and debug information provided by out by Long indicates
> that when multiple checkpoints start at the same LSN, the objects in
> the later checkpoints (based on commit record ordering) won't get
> replayed because the LSN in the object has already been updated by
> the first checkpoint. Hence they skip recovery in the second (and
> subsequent) checkpoints at the same start LSN.
> 
> In a lot of these cases, the object will be logged again later in
> the recovery process, thereby overwriting the corruption caused by
> skipping a checkpointed update. Hence this will only be exposed in
> normal situations if the silent recovery failure occurs on the last
> modification of the object in the journal.
> 
> This is why it's a rare failure to be seen in production systems,
> but it is something that hindsight tells us has been occurring given
> the repeated reports of unexplainable single record free space btree
> corruption we've had over the past few years.
> 
> > Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
> > boundaries") basically created a new invariant for log recovery where
> > buffers are allowed to be written only once per LSN. The risk otherwise
> > is that a subsequent update with a matching LSN would not be correctly
> > applied due to the v5 LSN ordering rules. Since log recovery processes
> > transactions (using terminology/granularity as defined by the
> > implementation of xlog_recover_commit_trans()), this required changes to
> > accommodate any of the various possible runtime logging scenarios that
> > could cause a buffer to have multiple entries in the log associated with
> > a single LSN, the details of which were orthogonal to the fix.
> > 
> > The functional change therefore was that rather than to process and
> > submit "transactions" in sequence during recovery, the pending buffer
> > list was lifted to a higher level in the code, a tracking field was
> > added for the "current LSN" of log recovery, and only once we cross a
> > current LSN boundary are we allowed to submit the set of buffers
> > processed for the prior LSN. The reason for this logic is that seeing
> > the next LSN was really the only way we know we're done processing items
> > for a particular LSN.
> 
> Yes, and therein lies one of the problems with the current
> implementation - this "lsn has changed" logic is incorrect.
> 
> > If I understand the problem description correctly, the issue here is
> > that if an error is encountered in the middle of processing items for
> > some LSN A, we bail out of recovery and submit the pending buffers on
> > the way out.  If we haven't completed processing all items for LSN A
> > before failing, however, then we've just possibly violated the "write
> > once per LSN" invariant that protects from corrupting the fs.
> 
> The error handling and/or repeated runs of log recovery simply
> exposes the problem - these symptoms are not the problem that needs
> to be fixed.
> 
> The issue is that the code as it stands doesn't handle object
> recovery from multiple checkpoints with the same start lsn. The
> easiest way to understand this is to look at the buffer submit logic
> on completion of a checkpoint:
> 
> 	if (log->l_recovery_lsn != trans->r_lsn &&
>             ohead->oh_flags & XLOG_COMMIT_TRANS) {
>                 error = xfs_buf_delwri_submit(buffer_list);
>                 if (error)
>                         return error;
>                 log->l_recovery_lsn = trans->r_lsn;
>         }
> 
> This submits the buffer list on the first checkpoint that completes
> with a new start LSN, not when all the checkpoints with the same
> start LSN complete. i.e.:
> 
> checkpoint  start LSN	commit lsn	submission on commit record
> A		32	  63		buffer list for A
> B		64	  68		buffer list for B
> C		64	  92		nothing, start lsn unchanged
> D		64	 127		nothing, start lsn unchanged
> E		128	 192		buffer list for C, D and E
> 

I have different understanding about this code. In the first checkpoint's
handle on commit record, buffer_list is empty and l_recovery_lsn update to
the first checkpoint's lsn, the result is that each checkpoint's submit
logic try to submit the buffers which was added to buffer list in checkpoint
recovery of previous LSN.

  xlog_do_recovery_pass
    LIST_HEAD (buffer_list);
    xlog_recover_process
      xlog_recover_process_data
        xlog_recover_process_ophdr
          xlog_recovery_process_trans
            if (log->l_recovery_lsn != trans->r_lsn &&
                ohead->oh_flags & XLOG_COMMIT_TRANS) { 
              xfs_buf_delwri_submit(buffer_list); //submit buffer list
              log->l_recovery_lsn = trans->r_lsn;
            }
            xlog_recovery_process_trans
              xlog_recover_commit_trans
                xlog_recover_items_pass2
                  item->ri_ops->commit_pass2
                    xlog_recover_buf_commit_pass2
                      xfs_buf_delwri_queue(bp, buffer_list) //add bp to buffer list
    if (!list_empty(&buffer_list)) 
      /* submit buffers that was added in checkpoint recovery of last LSN */
      xfs_buf_delwri_submit(&buffer_list)

So, I think it should be:

checkpoint  start LSN	commit lsn	submission on commit record
A		32	  63		nothing, buffer list is empty
B		64	  68		buffer list for A
C		64	  92		nothing, start lsn unchanged
D		64	 127		nothing, start lsn unchanged
E		128	 192		buffer list for B, C and D

Thanks,
Long Li

> IOWs, the invariant "don't submit buffers until LSN changes" is not
> actually implemented correctly by this code. This is the obvious
> aspect of the problem, but addressing buffer submission doesn't
> actually fix the problem.
> 
> That is, changing buffer submission to be correct doesn't address
> the fact that we've already done things like updated the LSN in
> inodes and dquots during recovery of those objects. Hence,
> regardless of whether we submit the buffers or not, changes to
> non-buffer objects in checkpoints C and D will never get recovered
> directly if they were originally modified in checkpoint B.
> 
> This is the problem we need to address: if we have multiple
> checkpoints at the same start LSN, we need to ensure that all the
> changes to any object in any of the checkpoints at that start LSN
> are recovered. This is what we are not doing, and this is the root
> cause of the problem....
> 
> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx