On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote: > While performing the IO fault injection test, I caught the following data > corruption report: > > XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0x79c/0x1130 > CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 > Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker > Call Trace: > <TASK> > dump_stack_lvl+0x50/0x70 > xfs_corruption_error+0x134/0x150 > xfs_free_ag_extent+0x7d3/0x1130 > __xfs_free_extent+0x201/0x3c0 > xfs_trans_free_extent+0x29b/0xa10 > xfs_extent_free_finish_item+0x2a/0xb0 > xfs_defer_finish_noroll+0x8d1/0x1b40 > xfs_defer_finish+0x21/0x200 > xfs_itruncate_extents_flags+0x1cb/0x650 > xfs_free_eofblocks+0x18f/0x250 > xfs_inactive+0x485/0x570 > xfs_inodegc_worker+0x207/0x530 > process_scheduled_works+0x24a/0xe10 > worker_thread+0x5ac/0xc60 > kthread+0x2cd/0x3c0 > ret_from_fork+0x4a/0x80 > ret_from_fork_asm+0x11/0x20 > </TASK> > XFS (dm-0): Corruption detected. Unmount and run xfs_repair > > After analyzing the disk image, it was found that the corruption was > triggered by the fact that extent was recorded in both the inode and AGF > btrees. After a long time of reproduction and analysis, we found that the > root cause of the problem was that the AGF btree block was not recovered. Why was it not recovered? Because of an injected IO error during recovery? > Consider the following situation, Transaction A and Transaction B are in > the same record, so Transaction A and Transaction B share the same LSN1. > If the buf item in Transaction A has been recovered, then the buf item in > Transaction B cannot be recovered, because log recovery skips items with a > metadata LSN >= the current LSN of the recovery item. This makes no sense to me. Transactions don't exist in the journal; they are purely in-memory constructs that are aggregated in memory (in the CIL) before being written to disk as an atomic checkpoint. Hence a log item can only appear once in a checkpoint regardless of how many transactions it is modified in memory between CIL checkpoints. > If there is still an > inode item in transaction B that records the Extent X, the Extent X will > be recorded in both the inode and the AGF btree block after transaction B > is recovered. That transaction should record both the addition to the inode BMBT and the removal from the AGF. Hence if transaction B is recovered in full with no errors, this should not occur. > > |------------Record (LSN1)------------------|---Record (LSN2)---| > |----------Trans A------------|-------------Trans B-------------| > | Buf Item(Extent X) | Buf Item / Inode item(Extent X) | > | Extent X is freed | Extent X is allocated | This looks wrong. A transaction can only exist in a single CIL checkpoint and everything in a checkpoint has the same LSN. Hence we cannot have the situation where trans B spans two different checkpoints and hence span LSNs. These are valid representations: |------------Record (LSN1)----|-----------------Record (LSN2)---| |----------Trans A------------|-------------Trans B-------------| |------------Record (LSN1)--------------------------------------| |----------Trans A------------|-------------Trans B-------------| |-----------------------------------------------Record (LSN2)---| |----------Trans A------------|-------------Trans B-------------| Only in the first case are there two instances of the AGF buf item object in the journal (one in each checkpoint). In the latter two cases, there is only one copy of the AGF buf log item that contains extent X. Indeed, it will not contain extent X, because the CIL aggregation results in the addition in trans A being elided by the removal in trans B, essentially resulting in the buffer being unchanged except for the LSN after recovery. As such, I'm really not sure what you are trying to describe here - if recovery of the checkpoint at LSN1 fails in any way, we should never attempt to recovery the checkpoint at LSN2. If LSN1 recoveres entirely successfully, then LSN2 should see the correct state and recover appropriately, too. Hence I don't see how the situation you are describing arises. > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on > LSN boundaries") was introduced, we submit buffers on lsn boundaries during > log recovery. Correct - we submit all the changes in a checkpoint for submission before we start recovering the next checkpoint. That's because checkpoints are supposed to be atomic units of change moving the on-disk state from one change set to the next. If any error during processing of a checkpoint occurs, we are supposed to abort recovery at that checkpoint so we don't create a situation where future recovery attempts skip checkpoints that need to be recovered. It does not matter if we write back the modified buffers from partially completed checkpoints - they were successfully recovered in their entirity, and so it is safe to write them back knowing that the next attempt to recover the failed checkpoint will see a matching LSN and skip that buffer item. If writeback fails, then it just doesn't matter as the next recovery attempt will see the old LSN and recover that buf item again and write it back.... AFAICT, you're describing things working as they are supposed to, and I don't see where the problem you are attempting to fix is yet. > The above problem can be avoided under normal paths, but it's > not guaranteed under abnormal paths. Consider the following process, if an > error was encountered after recover buf item in transaction A and before > recover buf item in transaction B, buffers that have been added to > buffer_list will still be submitted, this violates the submits rule on lsn > boundaries. So buf item in Transaction B cannot be recovered on the next > mount due to current lsn of transaction equal to metadata lsn on disk. > > xlog_do_recovery_pass > error = xlog_recover_process > xlog_recover_process_data > ... > xlog_recover_buf_commit_pass2 > xlog_recover_do_reg_buffer //recover buf item in Trans A > xfs_buf_delwri_queue(bp, buffer_list) > ... > ====> Encountered error and returned What error is this, and why isn't it a fatal error causing the checkpoint recovery to be aborted and the delwri list to be canceled? > ... > xlog_recover_buf_commit_pass2 > xlog_recover_do_reg_buffer //recover buf item in Trans B > xfs_buf_delwri_queue(bp, buffer_list) This should never occur as we should be aborting log recovery on the first failure, not continuing to process the checkpoint or starting to process other checkpoints. Where are we failing to handle an error? > if (!list_empty(&buffer_list)) > xfs_buf_delwri_submit(&buffer_list); //submit regardless of error Yes, that's fine (as per above). Indeed, this is how we handle releasing the buffer log item on failure - this goes through IO completion and that release the buf log item we added to the buffer during recovery for LSN stamping. > In order to make sure that submits buffers on lsn boundaries in the > abnormal paths, we need to check error status before submit buffers that > have been added from the last record processed. If error status exist, > buffers in the bufffer_list should be canceled. No, it does not need to be cancelled, it just needs to be processed. Anything we've fully recovered is safe to write - it's no different from having the system crash during AIL writeback having written back these buffers and having to recover from part way through this checkpoint. > Canceling the buffers in the buffer_list directly isn't correct, unlike > any other place where write list was canceled, these buffers has been > initialized by xfs_buf_item_init() during recovery and held by buf > item, buf items will not be released in xfs_buf_delwri_cancel(). If > these buffers are submitted successfully, buf items assocated with > the buffer will be released in io end process. So releasing buf item > in write list cacneling process is needed. Yes, that's why we use xfs_buf_delwri_submit() even on error - it handles releasing the buffer log item in IO completion handling (even on error). > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery") > Signed-off-by: Long Li <leo.lilong@xxxxxxxxxx> > --- > fs/xfs/xfs_buf.c | 2 ++ > fs/xfs/xfs_log_recover.c | 22 +++++++++++++--------- > 2 files changed, 15 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 8e5bd50d29fe..6a1b26aaf97e 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel( > xfs_buf_lock(bp); > bp->b_flags &= ~_XBF_DELWRI_Q; > xfs_buf_list_del(bp); > + if (bp->b_log_item) > + xfs_buf_item_relse(bp); > xfs_buf_relse(bp); > } > } I don't think this is a good idea - the delwri does not own the buf log item reference, and so this will cause problems with anything that already handles buf log item references correctly. > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 1251c81e55f9..2cda6c90890d 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -2964,7 +2964,6 @@ xlog_do_recovery_pass( > char *offset; > char *hbp, *dbp; > int error = 0, h_size, h_len; > - int error2 = 0; > int bblks, split_bblks; > int hblks, split_hblks, wrapped_hblks; > int i; > @@ -3203,16 +3202,21 @@ xlog_do_recovery_pass( > bread_err1: > kmem_free(hbp); > > - /* > - * Submit buffers that have been added from the last record processed, > - * regardless of error status. > - */ > - if (!list_empty(&buffer_list)) > - error2 = xfs_buf_delwri_submit(&buffer_list); > - > if (error && first_bad) > *first_bad = rhead_blk; > > + /* > + * If there are no error, submit buffers that have been added from the > + * last record processed, othrewise cancel the write list, to ensure > + * submit buffers on LSN boundaries. > + */ > + if (!list_empty(&buffer_list)) { > + if (error) > + xfs_buf_delwri_cancel(&buffer_list); > + else > + error = xfs_buf_delwri_submit(&buffer_list); > + } > + > /* > * Transactions are freed at commit time but transactions without commit > * records on disk are never committed. Free any that may be left in the > @@ -3226,7 +3230,7 @@ xlog_do_recovery_pass( > xlog_recover_free_trans(trans); > } > > - return error ? error : error2; > + return error; > } > > /* > -- > 2.31.1 > > > -- Dave Chinner david@xxxxxxxxxxxxx