On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote: > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > There's a subtle design flaw in the deferred log item code that can lead > > to pinning the log tail. Taking up the defer ops chain examples from > > the previous commit, we can get trapped in sequences like this: > > > > Caller hands us a transaction t0 with D0-D3 attached. The defer ops > > chain will look like the following if the transaction rolls succeed: > > > > t1: D0(t0), D1(t0), D2(t0), D3(t0) > > t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) > > t3: d5(t1), D1(t0), D2(t0), D3(t0) > > ... > > t9: d9(t7), D3(t0) > > t10: D3(t0) > > t11: d10(t10), d11(t10) > > t12: d11(t10) > > > > In transaction 9, we finish d9 and try to roll to t10 while holding onto > > an intent item for D3 that we logged in t0. > > > > The previous commit changed the order in which we place new defer ops in > > the defer ops processing chain to reduce the maximum chain length. Now > > make xfs_defer_finish_noroll capable of relogging the entire chain > > periodically so that we can always move the log tail forward. Most > > chains will never get relogged, except for operations that generate very > > long chains (large extents containing many blocks with different sharing > > levels) or are on filesystems with small logs and a lot of ongoing > > metadata updates. > > > > Callers are now required to ensure that the transaction reservation is > > large enough to handle logging done items and new intent items for the > > maximum possible chain length. Most callers are careful to keep the > > chain lengths low, so the overhead should be minimal. > > > > The decision to relog an intent item is made based on whether or not the > > intent was added to the current checkpoint. If so, the checkpoint is > > still open and there's no point in relogging. Otherwise, the old > > checkpoint is closed and we relog the intent to add it to the current > > one. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_defer.c | 52 ++++++++++++++++++++++++++++++++++++++++++++ > > fs/xfs/xfs_bmap_item.c | 27 +++++++++++++++++++++++ > > fs/xfs/xfs_extfree_item.c | 29 +++++++++++++++++++++++++ > > fs/xfs/xfs_refcount_item.c | 27 +++++++++++++++++++++++ > > fs/xfs/xfs_rmap_item.c | 27 +++++++++++++++++++++++ > > fs/xfs/xfs_trace.h | 1 + > > fs/xfs/xfs_trans.h | 10 ++++++++ > > 7 files changed, 173 insertions(+) > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > index 84a70edd0da1..c601cc2af254 100644 > > --- a/fs/xfs/libxfs/xfs_defer.c > > +++ b/fs/xfs/libxfs/xfs_defer.c > > @@ -17,6 +17,7 @@ > > #include "xfs_inode_item.h" > > #include "xfs_trace.h" > > #include "xfs_icache.h" > > +#include "xfs_log.h" > > > > /* > > * Deferred Operations in XFS > > @@ -361,6 +362,52 @@ xfs_defer_cancel_list( > > } > > } > > > > +/* > > + * Prevent a log intent item from pinning the tail of the log by logging a > > + * done item to release the intent item; and then log a new intent item. > > + * The caller should provide a fresh transaction and roll it after we're done. > > + */ > > +static int > > +xfs_defer_relog( > > + struct xfs_trans **tpp, > > + struct list_head *dfops) > > +{ > > + struct xfs_defer_pending *dfp; > > + xfs_lsn_t threshold_lsn; > > + > > + ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES); > > + > > + /* > > + * Figure out where we need the tail to be in order to maintain the > > + * minimum required free space in the log. > > + */ > > + threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0); > > + if (threshold_lsn == NULLCOMMITLSN) > > + return 0; > > This smells of premature optimisation. > > When we are in a tail-pushing scenario (i.e. any sort of > sustained metadata workload) this will always return true, and so we > will relog every intent that isn't in the current checkpoint every > time this is called. Under light load, we don't care if we add a > little bit of relogging overhead as the CIL slowly flushes/pushes - > it will have neglible impact on performance because there is little > load on the journal. > > However, when we are under heavy load the code will now be reading > the grant head and log position accounting variables during every > commit, hence greatly increasing the number and temporal > distribution of accesses to the hotest cachelines in the log. We > currently never access these cache lines during commit unless the > unit reservation has run out and we have to regrant physical log > space for the transaction to continue (i.e. we are into slow path > commit code). IOWs, this is like causing far more than double the > number of accesses to the grant head, the log tail, the > last_sync_lsn, etc, all of which is unnecessary exactly when we care > about minimising contention on the log space accounting variables... > > Given that it is a redundant check under heavy load journal load > when access to the log grant/head/tail are already contended, > I think we should just be checking the "in current checkpoint" logic > and not making it conditional on the log being near full. <nod> FWIW I broke this patch up again into the first part that only does relogging if the checkpoints don't match, and a second part that does the LSN push target check to see if I could observe any difference. Across a ~4h fstests run I noticed that there was about ~20% fewer relogs, but OTOH the total runtime didn't change noticeably. I kind of wondered if the increased cacheline contention would at least slow down the frontend a bit to give the log a chance to push things out, but haven't had time to dig any further than "ran fstests, recorded runtimes and grep | wc -l'd the ftrace log". Anyway, I was about to resend with all these patches rebased against something resembling the 5.10 branch, so expect to see this broken out a bit. --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx