On Wed, May 19, 2021 at 10:12:48PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > The AIL pushing is stalling on log forces when it comes across > pinned items. This is happening on removal workloads where the AIL > is dominated by stale items that are removed from AIL when the > checkpoint that marks the items stale is committed to the journal. > This results is relatively few items in the AIL, but those that are > are often pinned as directories items are being removed from are > still being logged. > > As a result, many push cycles through the CIL will first issue a > blocking log force to unpin the items. This can take some time to > complete, with tracing regularly showing push delays of half a > second and sometimes up into the range of several seconds. Sequences > like this aren't uncommon: > > .... > 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 270ms delay> > 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 > 400.099623: xfsaild: first lsn 0x11002f3600 > 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 > <wanted 50ms, got 500ms delay> > 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 > 400.589349: xfsaild: first lsn 0x1100305000 > 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 > <wanted 50ms, got 460ms delay> > 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 > 400.950343: xfsaild: first lsn 0x1100317c00 > 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 200ms delay> > 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 > 401.142334: xfsaild: first lsn 0x110032e600 > 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 > <wanted 10ms, got 10ms delay> > 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 > 401.154328: xfsaild: first lsn 0x1100353000 > 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 300ms delay> > 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > 401.451526: xfsaild: first lsn 0x1100353000 > 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 > <wanted 50ms, got 500ms delay> > 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > .... > > In each of these cases, every AIL pass saw 101 log items stuck on > the AIL (pinned) with very few other items being found. Each pass, a > log force was issued, and delay between last/first is the sleep time > + the sync log force time. > > Some of these 101 items pinned the tail of the log. The tail of the > log does slowly creep forward (first lsn), but the problem is that > the log is actually out of reservation space because it's been > running so many transactions that stale items that never reach the > AIL but consume log space. Hence we have a largely empty AIL, with > long term pins on items that pin the tail of the log that don't get > pushed frequently enough to keep log space available. > > The problem is the hundreds of milliseconds that we block in the log > force pushing the CIL out to disk. The AIL should not be stalled > like this - it needs to run and flush items that are at the tail of > the log with minimal latency. What we really need to do is trigger a > log flush, but then not wait for it at all - we've already done our > waiting for stuff to complete when we backed off prior to the log > force being issued. > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > still do a blocking flush of the CIL and that is what is causing the > issue. Hence we need a new interface for the CIL to trigger an > immediate background push of the CIL to get it moving faster but not > to wait on that to occur. While the CIL is pushing, the AIL can also > be pushing. > > We already have an internal interface to do this - > xlog_cil_push_now() - but we need a wrapper for it to be used > externally. xlog_cil_force_seq() can easily be extended to do what > we need as it already implements the synchronous CIL push via > xlog_cil_push_now(). Add the necessary flags and "push current > sequence" semantics to xlog_cil_force_seq() and convert the AIL > pushing to use it. > > One of the complexities here is that the CIL push does not guarantee > that the commit record for the CIL checkpoint is written to disk. > The current log force ensures this by submitting the current ACTIVE > iclog that the commit record was written to. We need the CIL to > actually write this commit record to disk for an async push to > ensure that the checkpoint actually makes it to disk and unpins the > pinned items in the checkpoint on completion. Hence we need to pass > down to the CIL push that we are doing an async flush so that it can > switch out the commit_iclog if necessary to get written to disk when > the commit iclog is finally released. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Not sure if I like this new push_commit_stable name, but at least it doesn't have foo_sync = bar_async like last time. Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> --D > --- > fs/xfs/xfs_log.c | 38 ++++++++++++++------------ > fs/xfs/xfs_log.h | 1 + > fs/xfs/xfs_log_cil.c | 61 ++++++++++++++++++++++++++++++++++++------ > fs/xfs/xfs_log_priv.h | 5 ++++ > fs/xfs/xfs_sysfs.c | 1 + > fs/xfs/xfs_trace.c | 1 + > fs/xfs/xfs_trans.c | 2 +- > fs/xfs/xfs_trans_ail.c | 11 +++++--- > 8 files changed, 91 insertions(+), 29 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index aa37f4319052..c53644d19dd3 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -50,11 +50,6 @@ xlog_state_get_iclog_space( > int *continued_write, > int *logoffsetp); > STATIC void > -xlog_state_switch_iclogs( > - struct xlog *log, > - struct xlog_in_core *iclog, > - int eventual_size); > -STATIC void > xlog_grant_push_ail( > struct xlog *log, > int need_bytes); > @@ -3104,7 +3099,7 @@ xfs_log_ticket_ungrant( > * This routine will mark the current iclog in the ring as WANT_SYNC and move > * the current iclog pointer to the next iclog in the ring. > */ > -STATIC void > +void > xlog_state_switch_iclogs( > struct xlog *log, > struct xlog_in_core *iclog, > @@ -3251,6 +3246,20 @@ xfs_log_force( > return -EIO; > } > > +/* > + * Force the log to a specific LSN. > + * > + * If an iclog with that lsn can be found: > + * If it is in the DIRTY state, just return. > + * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC > + * state and go to sleep or return. > + * If it is in any other state, go to sleep or return. > + * > + * Synchronous forces are implemented with a wait queue. All callers trying > + * to force a given lsn to disk must wait on the queue attached to the > + * specific in-core log. When given in-core log finally completes its write > + * to disk, that thread will wake up all threads waiting on the queue. > + */ > static int > xlog_force_lsn( > struct xlog *log, > @@ -3314,18 +3323,13 @@ xlog_force_lsn( > } > > /* > - * Force the in-core log to disk for a specific LSN. > + * Force the log to a specific checkpoint sequence. > * > - * Find in-core log with lsn. > - * If it is in the DIRTY state, just return. > - * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC > - * state and go to sleep or return. > - * If it is in any other state, go to sleep or return. > - * > - * Synchronous forces are implemented with a wait queue. All callers trying > - * to force a given lsn to disk must wait on the queue attached to the > - * specific in-core log. When given in-core log finally completes its write > - * to disk, that thread will wake up all threads waiting on the queue. > + * First force the CIL so that all the required changes have been flushed to the > + * iclogs. If the CIL force completed it will return a commit LSN that indicates > + * the iclog that needs to be flushed to stable storage. If the caller needs > + * a synchronous log force, we will wait on the iclog with the LSN returned by > + * xlog_cil_force_seq() to be completed. > */ > int > xfs_log_force_seq( > diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h > index 813b972e9788..1bd080ce3a95 100644 > --- a/fs/xfs/xfs_log.h > +++ b/fs/xfs/xfs_log.h > @@ -104,6 +104,7 @@ struct xlog_ticket; > struct xfs_log_item; > struct xfs_item_ops; > struct xfs_trans; > +struct xlog; > > int xfs_log_force(struct xfs_mount *mp, uint flags); > int xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags, > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 3c2b1205944d..cb849e67b1c4 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -658,6 +658,7 @@ xlog_cil_push_work( > xfs_lsn_t push_seq; > struct bio bio; > DECLARE_COMPLETION_ONSTACK(bdev_flush); > + bool push_commit_stable; > > new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); > new_ctx->ticket = xlog_cil_ticket_alloc(log); > @@ -668,6 +669,8 @@ xlog_cil_push_work( > spin_lock(&cil->xc_push_lock); > push_seq = cil->xc_push_seq; > ASSERT(push_seq <= ctx->sequence); > + push_commit_stable = cil->xc_push_commit_stable; > + cil->xc_push_commit_stable = false; > > /* > * As we are about to switch to a new, empty CIL context, we no longer > @@ -910,8 +913,15 @@ xlog_cil_push_work( > * The commit iclog must be written to stable storage to guarantee > * journal IO vs metadata writeback IO is correctly ordered on stable > * storage. > + * > + * If the push caller needs the commit to be immediately stable and the > + * commit_iclog is not yet marked as XLOG_STATE_WANT_SYNC to indicate it > + * will be written when released, switch it's state to WANT_SYNC right > + * now. > */ > commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA; > + if (push_commit_stable && commit_iclog->ic_state == XLOG_STATE_ACTIVE) > + xlog_state_switch_iclogs(log, commit_iclog, 0); > xlog_state_release_iclog(log, commit_iclog); > spin_unlock(&log->l_icloglock); > return; > @@ -996,13 +1006,26 @@ xlog_cil_push_background( > /* > * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence > * number that is passed. When it returns, the work will be queued for > - * @push_seq, but it won't be completed. The caller is expected to do any > - * waiting for push_seq to complete if it is required. > + * @push_seq, but it won't be completed. > + * > + * If the caller is performing a synchronous force, we will flush the workqueue > + * to get previously queued work moving to minimise the wait time they will > + * undergo waiting for all outstanding pushes to complete. The caller is > + * expected to do the required waiting for push_seq to complete. > + * > + * If the caller is performing an async push, we need to ensure that the > + * checkpoint is fully flushed out of the iclogs when we finish the push. If we > + * don't do this, then the commit record may remain sitting in memory in an > + * ACTIVE iclog. This then requires another full log force to push to disk, > + * which defeats the purpose of having an async, non-blocking CIL force > + * mechanism. Hence in this case we need to pass a flag to the push work to > + * indicate it needs to flush the commit record itself. > */ > static void > xlog_cil_push_now( > struct xlog *log, > - xfs_lsn_t push_seq) > + xfs_lsn_t push_seq, > + bool async) > { > struct xfs_cil *cil = log->l_cilp; > > @@ -1012,7 +1035,8 @@ xlog_cil_push_now( > ASSERT(push_seq && push_seq <= cil->xc_current_sequence); > > /* start on any pending background push to minimise wait time on it */ > - flush_work(&cil->xc_push_work); > + if (!async) > + flush_work(&cil->xc_push_work); > > /* > * If the CIL is empty or we've already pushed the sequence then > @@ -1025,6 +1049,7 @@ xlog_cil_push_now( > } > > cil->xc_push_seq = push_seq; > + cil->xc_push_commit_stable = async; > queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); > spin_unlock(&cil->xc_push_lock); > } > @@ -1109,12 +1134,27 @@ xlog_cil_commit( > xlog_cil_push_background(log); > } > > +/* > + * Flush the CIL to stable storage but don't wait for it to complete. This > + * requires the CIL push to ensure the commit record for the push hits the disk, > + * but otherwise is no different to a push done from a log force. > + */ > +void > +xlog_cil_flush( > + struct xlog *log) > +{ > + xfs_csn_t seq = log->l_cilp->xc_current_sequence; > + > + trace_xfs_log_force(log->l_mp, seq, _RET_IP_); > + xlog_cil_push_now(log, seq, true); > +} > + > /* > * Conditionally push the CIL based on the sequence passed in. > * > - * We only need to push if we haven't already pushed the sequence > - * number given. Hence the only time we will trigger a push here is > - * if the push sequence is the same as the current context. > + * We only need to push if we haven't already pushed the sequence number given. > + * Hence the only time we will trigger a push here is if the push sequence is > + * the same as the current context. > * > * We return the current commit lsn to allow the callers to determine if a > * iclog flush is necessary following this call. > @@ -1130,13 +1170,17 @@ xlog_cil_force_seq( > > ASSERT(sequence <= cil->xc_current_sequence); > > + if (!sequence) > + sequence = cil->xc_current_sequence; > + trace_xfs_log_force(log->l_mp, sequence, _RET_IP_); > + > /* > * check to see if we need to force out the current context. > * xlog_cil_push() handles racing pushes for the same sequence, > * so no need to deal with it here. > */ > restart: > - xlog_cil_push_now(log, sequence); > + xlog_cil_push_now(log, sequence, false); > > /* > * See if we can find a previous sequence still committing. > @@ -1160,6 +1204,7 @@ xlog_cil_force_seq( > * It is still being pushed! Wait for the push to > * complete, then start again from the beginning. > */ > + XFS_STATS_INC(log->l_mp, xs_log_force_sleep); > xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); > goto restart; > } > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 2d7e7cbee8b7..a863ccb5ece6 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -273,6 +273,7 @@ struct xfs_cil { > > spinlock_t xc_push_lock ____cacheline_aligned_in_smp; > xfs_csn_t xc_push_seq; > + bool xc_push_commit_stable; > struct list_head xc_committing; > wait_queue_head_t xc_commit_wait; > xfs_csn_t xc_current_sequence; > @@ -487,9 +488,12 @@ int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, > struct xlog_in_core **commit_iclog, uint optype); > int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, > struct xlog_in_core **iclog, xfs_lsn_t *lsn); > + > void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); > void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); > > +void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, > + int eventual_size); > int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog); > > /* > @@ -560,6 +564,7 @@ void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, > /* > * CIL force routines > */ > +void xlog_cil_flush(struct xlog *log); > xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence); > > static inline void > diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c > index f1bc88f4367c..18dc5eca6c04 100644 > --- a/fs/xfs/xfs_sysfs.c > +++ b/fs/xfs/xfs_sysfs.c > @@ -10,6 +10,7 @@ > #include "xfs_log_format.h" > #include "xfs_trans_resv.h" > #include "xfs_sysfs.h" > +#include "xfs_log.h" > #include "xfs_log_priv.h" > #include "xfs_mount.h" > > diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c > index 7e01e00550ac..4c86afad1617 100644 > --- a/fs/xfs/xfs_trace.c > +++ b/fs/xfs/xfs_trace.c > @@ -20,6 +20,7 @@ > #include "xfs_bmap.h" > #include "xfs_attr.h" > #include "xfs_trans.h" > +#include "xfs_log.h" > #include "xfs_log_priv.h" > #include "xfs_buf_item.h" > #include "xfs_quota.h" > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 87bffd12c20c..c214a69b573d 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -9,7 +9,6 @@ > #include "xfs_shared.h" > #include "xfs_format.h" > #include "xfs_log_format.h" > -#include "xfs_log_priv.h" > #include "xfs_trans_resv.h" > #include "xfs_mount.h" > #include "xfs_extent_busy.h" > @@ -17,6 +16,7 @@ > #include "xfs_trans.h" > #include "xfs_trans_priv.h" > #include "xfs_log.h" > +#include "xfs_log_priv.h" > #include "xfs_trace.h" > #include "xfs_error.h" > #include "xfs_defer.h" > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index dbb69b4bf3ed..69aac416e2ce 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -17,6 +17,7 @@ > #include "xfs_errortag.h" > #include "xfs_error.h" > #include "xfs_log.h" > +#include "xfs_log_priv.h" > > #ifdef DEBUG > /* > @@ -429,8 +430,12 @@ xfsaild_push( > > /* > * If we encountered pinned items or did not finish writing out all > - * buffers the last time we ran, force the log first and wait for it > - * before pushing again. > + * buffers the last time we ran, force a background CIL push to get the > + * items unpinned in the near future. We do not wait on the CIL push as > + * that could stall us for seconds if there is enough background IO > + * load. Stalling for that long when the tail of the log is pinned and > + * needs flushing will hard stop the transaction subsystem when log > + * space runs out. > */ > if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && > (!list_empty_careful(&ailp->ail_buf_list) || > @@ -438,7 +443,7 @@ xfsaild_push( > ailp->ail_log_flush = 0; > > XFS_STATS_INC(mp, xs_push_ail_flush); > - xfs_log_force(mp, XFS_LOG_SYNC); > + xlog_cil_flush(mp->m_log); > } > > spin_lock(&ailp->ail_lock); > -- > 2.31.1 >