On Thu, Jul 22, 2021 at 08:06:47AM +0100, Christoph Hellwig wrote: > Due the combination of the code move and the actual change I had > a really hard time reviewing this. > > > - trace_xlog_iclog_syncing(iclog, _RET_IP_); > > This tracepoint got lost. > > Otherwise this looks good. > > As I had to split the move from the technical changes to follow > the change I might as well share my split, attached below as two > patches: > > From 6f8ff7f10b25986518ff76e9a6ef61edb629a97f Mon Sep 17 00:00:00 2001 > From: Dave Chinner <dchinner@xxxxxxxxxx> > Date: Thu, 22 Jul 2021 08:58:49 +0200 > Subject: xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog > > Fold __xlog_state_release_iclog intos it's only caller to prepare > make an upcomding fix easier. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > [hch: split from a larger patch] > Signed-off-by: Christoph Hellwig <hch@xxxxxx> I like this split out as a refactoring patch, it /does/ make things easier to read. For this patch, Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > fs/xfs/xfs_log.c | 46 ++++++++++++++++++---------------------------- > 1 file changed, 18 insertions(+), 28 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index a3c4d48195d98a..d7ab9d0814c2c6 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -487,29 +487,6 @@ xfs_log_reserve( > return error; > } > > -static bool > -__xlog_state_release_iclog( > - struct xlog *log, > - struct xlog_in_core *iclog) > -{ > - lockdep_assert_held(&log->l_icloglock); > - > - if (iclog->ic_state == XLOG_STATE_WANT_SYNC) { > - /* update tail before writing to iclog */ > - xfs_lsn_t tail_lsn = xlog_assign_tail_lsn(log->l_mp); > - > - iclog->ic_state = XLOG_STATE_SYNCING; > - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > - xlog_verify_tail_lsn(log, iclog, tail_lsn); > - /* cycle incremented when incrementing curr_block */ > - trace_xlog_iclog_syncing(iclog, _RET_IP_); > - return true; > - } > - > - ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); > - return false; > -} > - > /* > * Flush iclog to disk if this is the last reference to the given iclog and the > * it is in the WANT_SYNC state. > @@ -519,19 +496,32 @@ xlog_state_release_iclog( > struct xlog *log, > struct xlog_in_core *iclog) > { > + xfs_lsn_t tail_lsn; > + > lockdep_assert_held(&log->l_icloglock); > > trace_xlog_iclog_release(iclog, _RET_IP_); > if (iclog->ic_state == XLOG_STATE_IOERROR) > return -EIO; > > - if (atomic_dec_and_test(&iclog->ic_refcnt) && > - __xlog_state_release_iclog(log, iclog)) { > - spin_unlock(&log->l_icloglock); > - xlog_sync(log, iclog); > - spin_lock(&log->l_icloglock); > + if (!atomic_dec_and_test(&iclog->ic_refcnt)) > + return 0; > + > + if (iclog->ic_state != XLOG_STATE_WANT_SYNC) { > + ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); > + return 0; > } > > + /* update tail before writing to iclog */ > + tail_lsn = xlog_assign_tail_lsn(log->l_mp); > + iclog->ic_state = XLOG_STATE_SYNCING; > + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > + xlog_verify_tail_lsn(log, iclog, tail_lsn); > + trace_xlog_iclog_syncing(iclog, _RET_IP_); > + > + spin_unlock(&log->l_icloglock); > + xlog_sync(log, iclog); > + spin_lock(&log->l_icloglock); > return 0; > } > > -- > 2.30.2 > > From 7a592dfeef83ca61ae001d612b8932027e0080cf Mon Sep 17 00:00:00 2001 > From: Dave Chinner <dchinner@xxxxxxxxxx> > Date: Thu, 22 Jul 2021 09:00:08 +0200 > Subject: xfs: fix ordering violation between cache flushes and tail updates > > There is a race between the new CIL async data device metadata IO > completion cache flush and the log tail in the iclog the flush > covers being updated. This can be seen by repeating generic/482 in a > loop and eventually log recovery fails with a failures such as this: > > XFS (dm-3): Starting recovery (logdev: internal) > XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0) > XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify > XFS (dm-3): Unmount and run xfs_repair > XFS (dm-3): First 128 bytes of corrupted metadata buffer: > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117 > > Analysis of the logwrite replay shows that there were no writes to > the data device between the FUA @ write 124 and the FUA at write @ > 125, but log recovery @ 125 failed. The difference was the one log > write @ 125 moved the tail of the log forwards from (1,8) to (1,32) > and so the inode create intent in (1,8) was not replayed and so the > inode cluster was zero on disk when replay of the first inode item > in (1,32) was attempted. > > What this meant was that the journal write that occurred at @ 125 > did not ensure that metadata completed before the iclog was written > was correctly on stable storage. The tail of the log moved forward, > so IO must have been completed between the two iclog writes. This > means that there is a race condition between the unconditional async > cache flush in the CIL push work and the tail LSN that is written to > the iclog. This happens like so: > > CIL push work AIL push work > ------------- ------------- > Add to committing list > start async data dev cache flush > ..... > <flush completes> > <all writes to old tail lsn are stable> > xlog_write > .... push inode create buffer > <start IO> > ..... > xlog_write(commit record) > .... <IO completes> > log tail moves > xlog_assign_tail_lsn() > start_lsn == commit_lsn > <no iclog preflush!> > xlog_state_release_iclog > __xlog_state_release_iclog() > <writes *new* tail_lsn into iclog> > xlog_sync() > .... > submit_bio() > <tail in log moves forward without flushing written metadata> > > Essentially, this can only occur if the commit iclog is issued > without a cache flush. If the iclog bio is submitted with > REQ_PREFLUSH, then it will guarantee that all the completed IO is > one stable storage before the iclog bio with the new tail LSN in it > is written to the log. > > IOWs, the tail lsn that is written to the iclog needs to be sampled > *before* we issue the cache flush that guarantees all IO up to that > LSN has been completed. > > To fix this without giving up the performance advantage of the > flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1 > compared to 5.13), we need to ensure that we always issue a cache > flush if the tail LSN changes between the initial async flush and > the commit record being written. This requires sampling the tail_lsn > before we start the flush, and then passing the sampled tail LSN to > xlog_state_release_iclog() so it can determine if the the tail LSN > has changed while writing the checkpoint. If the tail LSN has > changed, then it needs to set the NEED_FLUSH flag on the iclog and > we'll issue another cache flush before writing the iclog. > > Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions") > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_log.c | 38 +++++++++++++++++++++++++++++--------- > fs/xfs/xfs_log_cil.c | 13 +++++++++++-- > fs/xfs/xfs_log_priv.h | 3 ++- > 3 files changed, 42 insertions(+), 12 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index d7ab9d0814c2c6..5767fd3ab90555 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -489,12 +489,20 @@ xfs_log_reserve( > > /* > * Flush iclog to disk if this is the last reference to the given iclog and the > - * it is in the WANT_SYNC state. > + * it is in the WANT_SYNC state. If the caller passes in a non-zero > + * @old_tail_lsn, then we need to check if the log tail is different to the Nit: NULLCOMMITLSN instead of zero? > + * caller's value. If it is different, this indicates that the log tail has > + * moved since the caller sampled the log tail and issued a cache flush and so "...sampled the log tail as part of issuing a data device cache flush, so..." > + * there may be metadata on disk taht we need to flush before this iclog is s/taht/that/ > + * written. In that case, set the XLOG_ICL_NEED_FLUSH flag so taht when the s/taht/that/ > + * iclog is synced we correctly issue a cache flush before the iclog is "...issue a cache flush to the log and data devices before the iclog..." With those issues addressed I think this look decent, modulo my question in the previous patch. --D > + * submitted. > */ > int > xlog_state_release_iclog( > struct xlog *log, > - struct xlog_in_core *iclog) > + struct xlog_in_core *iclog, > + xfs_lsn_t old_tail_lsn) > { > xfs_lsn_t tail_lsn; > > @@ -504,6 +512,19 @@ xlog_state_release_iclog( > if (iclog->ic_state == XLOG_STATE_IOERROR) > return -EIO; > > + /* > + * Grabbing the current log tail needs to be atomic w.r.t. the writing > + * of the tail LSN into the iclog so we guarantee that the log tail does > + * not move between deciding if a cache flush is required and writing > + * the LSN into the iclog below. > + */ > + if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) { > + tail_lsn = xlog_assign_tail_lsn(log->l_mp); > + > + if (old_tail_lsn && tail_lsn != old_tail_lsn) > + iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; > + } > + > if (!atomic_dec_and_test(&iclog->ic_refcnt)) > return 0; > > @@ -513,7 +534,6 @@ xlog_state_release_iclog( > } > > /* update tail before writing to iclog */ > - tail_lsn = xlog_assign_tail_lsn(log->l_mp); > iclog->ic_state = XLOG_STATE_SYNCING; > iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > xlog_verify_tail_lsn(log, iclog, tail_lsn); > @@ -859,7 +879,7 @@ xlog_unmount_write( > * iclog containing the unmount record is written. > */ > iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > xlog_wait_on_iclog(iclog); > > if (tic) { > @@ -2303,7 +2323,7 @@ xlog_write_copy_finish( > return 0; > > release_iclog: > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > spin_unlock(&log->l_icloglock); > return error; > } > @@ -2522,7 +2542,7 @@ xlog_write( > ASSERT(optype & XLOG_COMMIT_TRANS); > *commit_iclog = iclog; > } else { > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > } > spin_unlock(&log->l_icloglock); > > @@ -2960,7 +2980,7 @@ xlog_state_get_iclog_space( > * reference to the iclog. > */ > if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1)) > - error = xlog_state_release_iclog(log, iclog); > + error = xlog_state_release_iclog(log, iclog, 0); > spin_unlock(&log->l_icloglock); > if (error) > return error; > @@ -3196,7 +3216,7 @@ xfs_log_force( > atomic_inc(&iclog->ic_refcnt); > lsn = be64_to_cpu(iclog->ic_header.h_lsn); > xlog_state_switch_iclogs(log, iclog, 0); > - if (xlog_state_release_iclog(log, iclog)) > + if (xlog_state_release_iclog(log, iclog, 0)) > goto out_error; > > if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) > @@ -3276,7 +3296,7 @@ xlog_force_lsn( > } > atomic_inc(&iclog->ic_refcnt); > xlog_state_switch_iclogs(log, iclog, 0); > - if (xlog_state_release_iclog(log, iclog)) > + if (xlog_state_release_iclog(log, iclog, 0)) > goto out_error; > if (log_flushed) > *log_flushed = 1; > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index b128aaa9b870d5..4c44bc3786c0f0 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -654,8 +654,9 @@ xlog_cil_push_work( > struct xfs_trans_header thdr; > struct xfs_log_iovec lhdr; > struct xfs_log_vec lvhdr = { NULL }; > + xfs_lsn_t preflush_tail_lsn; > xfs_lsn_t commit_lsn; > - xfs_lsn_t push_seq; > + xfs_csn_t push_seq; > struct bio bio; > DECLARE_COMPLETION_ONSTACK(bdev_flush); > > @@ -730,7 +731,15 @@ xlog_cil_push_work( > * because we hold the flush lock exclusively. Hence we can now issue > * a cache flush to ensure all the completed metadata in the journal we > * are about to overwrite is on stable storage. > + * > + * Because we are issuing this cache flush before we've written the > + * tail lsn to the iclog, we can have metadata IO completions move the > + * tail forwards between the completion of this flush and the iclog > + * being written. In this case, we need to re-issue the cache flush > + * before the iclog write. To detect whether the log tail moves, sample > + * the tail LSN *before* we issue the flush. > */ > + preflush_tail_lsn = atomic64_read(&log->l_tail_lsn); > xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev, > &bdev_flush); > > @@ -941,7 +950,7 @@ xlog_cil_push_work( > * storage. > */ > commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA; > - xlog_state_release_iclog(log, commit_iclog); > + xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn); > spin_unlock(&log->l_icloglock); > return; > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 4c41bbfa33b0df..7cbde0b4f9901d 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -497,7 +497,8 @@ int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, > void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); > void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); > > -int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog); > +int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog, > + xfs_lsn_t log_tail_lsn); > > /* > * When we crack an atomic LSN, we sample it first so that the value will not > -- > 2.30.2 >