On Tue, Jul 27, 2021 at 05:10:05PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > There is a race between the new CIL async data device metadata IO > completion cache flush and the log tail in the iclog the flush > covers being updated. This can be seen by repeating generic/482 in a > loop and eventually log recovery fails with a failures such as this: Looks good, can't remember why I didn't do this last night: Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> --D > XFS (dm-3): Starting recovery (logdev: internal) > XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0) > XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify > XFS (dm-3): Unmount and run xfs_repair > XFS (dm-3): First 128 bytes of corrupted metadata buffer: > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117 > > Analysis of the logwrite replay shows that there were no writes to > the data device between the FUA @ write 124 and the FUA at write @ > 125, but log recovery @ 125 failed. The difference was the one log > write @ 125 moved the tail of the log forwards from (1,8) to (1,32) > and so the inode create intent in (1,8) was not replayed and so the > inode cluster was zero on disk when replay of the first inode item > in (1,32) was attempted. > > What this meant was that the journal write that occurred at @ 125 > did not ensure that metadata completed before the iclog was written > was correctly on stable storage. The tail of the log moved forward, > so IO must have been completed between the two iclog writes. This > means that there is a race condition between the unconditional async > cache flush in the CIL push work and the tail LSN that is written to > the iclog. This happens like so: > > CIL push work AIL push work > ------------- ------------- > Add to committing list > start async data dev cache flush > ..... > <flush completes> > <all writes to old tail lsn are stable> > xlog_write > .... push inode create buffer > <start IO> > ..... > xlog_write(commit record) > .... <IO completes> > log tail moves > xlog_assign_tail_lsn() > start_lsn == commit_lsn > <no iclog preflush!> > xlog_state_release_iclog > __xlog_state_release_iclog() > <writes *new* tail_lsn into iclog> > xlog_sync() > .... > submit_bio() > <tail in log moves forward without flushing written metadata> > > Essentially, this can only occur if the commit iclog is issued > without a cache flush. If the iclog bio is submitted with > REQ_PREFLUSH, then it will guarantee that all the completed IO is > one stable storage before the iclog bio with the new tail LSN in it > is written to the log. > > IOWs, the tail lsn that is written to the iclog needs to be sampled > *before* we issue the cache flush that guarantees all IO up to that > LSN has been completed. > > To fix this without giving up the performance advantage of the > flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1 > compared to 5.13), we need to ensure that we always issue a cache > flush if the tail LSN changes between the initial async flush and > the commit record being written. THis requires sampling the tail_lsn > before we start the flush, and then passing the sampled tail LSN to > xlog_state_release_iclog() so it can determine if the the tail LSN > has changed while writing the checkpoint. If the tail LSN has > changed, then it needs to set the NEED_FLUSH flag on the iclog and > we'll issue another cache flush before writing the iclog. > > Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions") > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > Reviewed-by: Christoph Hellwig <hch@xxxxxx> > --- > fs/xfs/xfs_log.c | 36 ++++++++++++++++++++++++++---------- > fs/xfs/xfs_log_cil.c | 13 +++++++++++-- > fs/xfs/xfs_log_priv.h | 3 ++- > 3 files changed, 39 insertions(+), 13 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 82f5996d3889..e8c6c96d4f7c 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -489,12 +489,17 @@ xfs_log_reserve( > > /* > * Flush iclog to disk if this is the last reference to the given iclog and the > - * it is in the WANT_SYNC state. > + * it is in the WANT_SYNC state. If the caller passes in a non-zero > + * @old_tail_lsn and the current log tail does not match, there may be metadata > + * on disk that must be persisted before this iclog is written. To satisfy that > + * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this > + * iclog with the new log tail value. > */ > int > xlog_state_release_iclog( > struct xlog *log, > - struct xlog_in_core *iclog) > + struct xlog_in_core *iclog, > + xfs_lsn_t old_tail_lsn) > { > xfs_lsn_t tail_lsn; > lockdep_assert_held(&log->l_icloglock); > @@ -503,6 +508,19 @@ xlog_state_release_iclog( > if (iclog->ic_state == XLOG_STATE_IOERROR) > return -EIO; > > + /* > + * Grabbing the current log tail needs to be atomic w.r.t. the writing > + * of the tail LSN into the iclog so we guarantee that the log tail does > + * not move between deciding if a cache flush is required and writing > + * the LSN into the iclog below. > + */ > + if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) { > + tail_lsn = xlog_assign_tail_lsn(log->l_mp); > + > + if (old_tail_lsn && tail_lsn != old_tail_lsn) > + iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; > + } > + > if (!atomic_dec_and_test(&iclog->ic_refcnt)) > return 0; > > @@ -511,8 +529,6 @@ xlog_state_release_iclog( > return 0; > } > > - /* update tail before writing to iclog */ > - tail_lsn = xlog_assign_tail_lsn(log->l_mp); > iclog->ic_state = XLOG_STATE_SYNCING; > iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > xlog_verify_tail_lsn(log, iclog, tail_lsn); > @@ -858,7 +874,7 @@ xlog_unmount_write( > * iclog containing the unmount record is written. > */ > iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > xlog_wait_on_iclog(iclog); > > if (tic) { > @@ -2302,7 +2318,7 @@ xlog_write_copy_finish( > return 0; > > release_iclog: > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > spin_unlock(&log->l_icloglock); > return error; > } > @@ -2521,7 +2537,7 @@ xlog_write( > ASSERT(optype & XLOG_COMMIT_TRANS); > *commit_iclog = iclog; > } else { > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > } > spin_unlock(&log->l_icloglock); > > @@ -2959,7 +2975,7 @@ xlog_state_get_iclog_space( > * reference to the iclog. > */ > if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1)) > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > spin_unlock(&log->l_icloglock); > if (error) > return error; > @@ -3195,7 +3211,7 @@ xfs_log_force( > atomic_inc(&iclog->ic_refcnt); > lsn = be64_to_cpu(iclog->ic_header.h_lsn); > xlog_state_switch_iclogs(log, iclog, 0); > - if (xlog_state_release_iclog(log, iclog)) > + if (xlog_state_release_iclog(log, iclog, 0)) > goto out_error; > > if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) > @@ -3275,7 +3291,7 @@ xlog_force_lsn( > } > atomic_inc(&iclog->ic_refcnt); > xlog_state_switch_iclogs(log, iclog, 0); > - if (xlog_state_release_iclog(log, iclog)) > + if (xlog_state_release_iclog(log, iclog, 0)) > goto out_error; > if (log_flushed) > *log_flushed = 1; > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index b128aaa9b870..4c44bc3786c0 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -654,8 +654,9 @@ xlog_cil_push_work( > struct xfs_trans_header thdr; > struct xfs_log_iovec lhdr; > struct xfs_log_vec lvhdr = { NULL }; > + xfs_lsn_t preflush_tail_lsn; > xfs_lsn_t commit_lsn; > - xfs_lsn_t push_seq; > + xfs_csn_t push_seq; > struct bio bio; > DECLARE_COMPLETION_ONSTACK(bdev_flush); > > @@ -730,7 +731,15 @@ xlog_cil_push_work( > * because we hold the flush lock exclusively. Hence we can now issue > * a cache flush to ensure all the completed metadata in the journal we > * are about to overwrite is on stable storage. > + * > + * Because we are issuing this cache flush before we've written the > + * tail lsn to the iclog, we can have metadata IO completions move the > + * tail forwards between the completion of this flush and the iclog > + * being written. In this case, we need to re-issue the cache flush > + * before the iclog write. To detect whether the log tail moves, sample > + * the tail LSN *before* we issue the flush. > */ > + preflush_tail_lsn = atomic64_read(&log->l_tail_lsn); > xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev, > &bdev_flush); > > @@ -941,7 +950,7 @@ xlog_cil_push_work( > * storage. > */ > commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA; > - xlog_state_release_iclog(log, commit_iclog); > + xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn); > spin_unlock(&log->l_icloglock); > return; > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 4c41bbfa33b0..7cbde0b4f990 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -497,7 +497,8 @@ int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, > void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); > void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); > > -int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog); > +int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog, > + xfs_lsn_t log_tail_lsn); > > /* > * When we crack an atomic LSN, we sample it first so that the value will not > -- > 2.31.1 >