Re: [PATCH 3/3] xfs: periodically relog deferred intent items

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 29 Sep 2020 08:27:39 -0400

On Tue, Sep 29, 2020 at 09:09:10AM +1000, Dave Chinner wrote:
> On Mon, Sep 28, 2020 at 11:16:27AM -0400, Brian Foster wrote:
> > On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > > > +xfs_defer_relog(
> > > > +	struct xfs_trans		**tpp,
> > > > +	struct list_head		*dfops)
> > > > +{
> > > > +	struct xfs_defer_pending	*dfp;
> > > > +	xfs_lsn_t			threshold_lsn;
> > > > +
> > > > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > +
> > > > +	/*
> > > > +	 * Figure out where we need the tail to be in order to maintain the
> > > > +	 * minimum required free space in the log.
> > > > +	 */
> > > > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > > > +	if (threshold_lsn == NULLCOMMITLSN)
> > > > +		return 0;
> > > 
> > > This smells of premature optimisation.
> > > 
> > > When we are in a tail-pushing scenario (i.e. any sort of
> > > sustained metadata workload) this will always return true, and so we
> > > will relog every intent that isn't in the current checkpoint every
> > > time this is called.  Under light load, we don't care if we add a
> > > little bit of relogging overhead as the CIL slowly flushes/pushes -
> > > it will have neglible impact on performance because there is little
> > > load on the journal.
> > > 
> > 
> > That seems like an overly broad and not necessarily correct assumption.
> > The threshold above is tail relative and hardcoded (with the zero param)
> > to the default AIL pushing threshold, which is 25% of the log. If we
> 
> The pushing threshold is 75% of the log space has been reserved by
> the grant head, not that there's only 25% of the log free. The
> reservation is typically much, much larger than the change that is
> actually written to the log, and so we can be tail pushing even when
> the log is largely empty and the CIL hasn't reached it's threshold
> for pushing.
> 
> Think about it - if we have a log with a 256kB log stripe unit,
> a itruncate reservation is somewhere around 1.5MB in size (700kB *
> 2), and if we have reflink/rmap enabled, it's closer to 6MB in size.
> Yet a typical modification will not use any of the 512kB*8 space
> reserved for the LSU (because CIL!) and might only use 10kB of new
> log space in the CIL because that's all that was newly dirtied.
> 
> That means we reserve 6MB of log space for each itruncate
> transaction in flight, (directory ops are almost as heavyweight) so
> it only takes 16 transactions in flight to consume 100MB of reserve
> grant head space. If we've got a log smaller than 130MB in this
> situation, we are into tail pushing -even if the physical log is
> empty-.
> 
> IOWs, tail pushing is not triggered by how much physical log space
> has been consumed - it is triggered by reservation pressure. And we
> can trivially drive the system into reservation pressure with
> concurrent workloads
> 

Yes, I understand. What I'm saying is that once we get past the
reservation head checks, there is a threshold lsn check that requires an
intent sit in that 25% range from the on-disk tail of the log for it to
be relogged. That _might_ unnecessarily relog in certain cases (i.e.
when the active range of the log is small), but it is not always the
case and I don't think it results in per CIL context relogging in
practice.

> > assume sustained tail pushing conditions, a committed intent doesn't
> > trigger relog again until it falls within that threshold in the on-disk
> > log. The current CIL (nonblocking) size threshold is half that at 12.5%.
> 
> Or smaller. For small logs it is 12.5% of the log size. For larger
> logs, it caps at 32MB (8 * (256k << 4)).
> 
> > So relative to a given rolling transaction processing an intent chain,
> > there's a decent number of full CIL checkpoints that can occur before
> > any of those intents need to relog again (if at all), regardless of log
> > size.
> 
> Your logic is inverted because reservation pressure does not reflect
> CIL pressure.  Log space consumption and reservation pressure were
> decoupled by the in-memory relogging that the CIL enables, hence
> given a specific intent chain, the likely probability is that it
> will not span more than a single CIL context. i.e. it will complete
> rolling entirely within a single CIL commit.
> 
> As the log gets larger and the CIL increases in size, the
> probability that an intent chain completes within a single CIL
> commit goes up (e.g. 6MB reservation vs 32MB CIL push threshold).
> The lighter the load, the more likely it is that a transaction chain
> will complete within a single CIL checkpoint, even on small logs.
> 

Sure, I agree with that. That's not a major factor for intent chains
that pin the log tail long enough to cause deadlocks.

> > Without that logic, we're changing behavior to relog the entire chain in
> > every CIL checkpoint. That's a fairly significant change in behavior
> > with less predictable breakdown conditions. Do we know how long a single
> > chain is going to be?
> 
> Yes: the typical worst case chain length is defined by the
> transaction reservation log count. That is supposed to cover > %99
> of typical log space requirements for that operation.
> 

I don't think reservation log counts define worst case conditions for
intent chains. Darrick has documented cases in the commit logs that far
exceed those values. I'm assuming those are legitimate/observed
scenarios, but could be mistaken.

> For any chain that is longer than this, we end up in the slow
> regrant path that reserves a unit of log space at a time, and these
> are the cases where the transaction must be careful about not
> pinning the tail of the log. i.e. long chains are exactly the case
> where we want to relog the intents in every new CIL context. This is
> no different to the way we relog the inode in -every transaction
> roll- whether it is modified or not so that we can guarantee it
> doesn't pin the tail of the log when we need to regrant space on
> commit....
> 

Yes, the relogging approach is the same. The difference is that in one
case we're logging a small, fixed size (inode) object repeatedly that is
already factored into the transaction reservation and in the other case,
we're logging a dynamic and non-deterministic number of (intent) objects
a non-deterministic number of times.

> > Do we know how many CPUs are concurrently
> > processing "long chain" operations?
> 
> Yes: as many as the grant head reservations will allow. That's
> always been the bound maximum transaction concurrency we can
> support.
> 
> > Do we know whether an external
> > workload is causing even more frequent checkpoints than governed by the
> > CIL size limit?
> 
> Irrelevant, because checkpoint frequency affects the relogging
> efficiency of -everything- that running transactions. i.e. the
> impact is not isolated to relogging a few small intents, it affects
> relogging of all metadata, from AG headers to btree blocks,
> directories and inodes. IOWs, the overhead of relogging intents
> will be lost in the overhead of completely relogging entire dirty
> metadata blocks that the transactions also touch, even though they
> may have only newly dirtied a few bytes in each buffer.
> 
> > The examples in the commit logs in this series refer to
> > chains of hundreds of intents with hundreds of transaction rolls.
> 
> Truncating a file with 10 million extents will roll the transaction
> 10 million times. Having a long intent chain is no different from a
> "don't pin the tail of the log" perspective than truncating a
> badly fragmented file. Both need to relog their outstanding items
> that could pin the tail of the log as frequently as they commit...
> 
> This is a basic principle of permanent transactions: they must
> guarantee forwards progress by ensuring they can't pin the tail of
> the log at any time. That is, if you hold an item that *may* pin the
> tail of the log when you need to regrant log space, you need to move
> those items to the head of the log -before- you attempt to regrant
> log space. We do not do that with intents, and conditional relogging
> doesn't provide that guarantee, either.  Relogging the item once the
> item is no longer at the head of the log provides that forwards
> progress guarantee.
> 

It sounds like there's a correctness argument buried in here. If you
think the tail lsn threshold approach is not fundamentally correct, can
you please elaborate on that? I think fundamental correctness is far
more important at this early stage than the prospective impact of
cacheline contention caused by an implementation detail.

> I don't want to slap a band-aid over the problem in the name of
> "performance". Correctness comes first, then if we find a
> performance problem we'll address that without breaking correctness.
> 
...
> > > However, when we are under heavy load the code will now be reading
> > > the grant head and log position accounting variables during every
> > > commit, hence greatly increasing the number and temporal
> > > distribution of accesses to the hotest cachelines in the log. We
> > > currently never access these cache lines during commit unless the
> > > unit reservation has run out and we have to regrant physical log
> > > space for the transaction to continue (i.e. we are into slow path
> > > commit code). IOWs, this is like causing far more than double the
> > > number of accesses to the grant head, the log tail, the
> > > last_sync_lsn, etc, all of which is unnecessary exactly when we care
> > > about minimising contention on the log space accounting variables...
> > > 
> > 
> > If we're under sustained tail pushing pressure, blocking transactions
> > have already hit this cacheline as well, FWIW.
> 
> No, tail pushing != blocking. If the AIL pushing is keeping up with
> the targets that are being set (as frequently happens with large
> logs and fast storage), then we rarely run out of log space and we
> do not block new reservations waiting for reservation space.
> 

Just pointing out that there's another prospective path for a
transaction to hit the cacheline. As you've already pointed out, it's
quite easy to reach reservation blocking conditions before the AIL is
even involved.

> > But if we're concerned about cacheline access in the commit path
> > specifically, I'm sure we could come up with any number of optimizations
> > to address that directly. E.g., we could sample the threshold
> > occasionally instead of on every roll, wait until the current
> > transaction has consumed all logres units (where we'll be hitting that
> > line anyways), shift state tracking over to xfsaild via setting a flag
> > on log items with an ->iop_relog() handler set that happen to pin the
> > tail, etc. etc. That said, I do think any such added complexity should
> > be justified with some accompanying profiling data (and I like the idea
> > of new stats counters mentioned in the separate mail).
> 
> Yes, there's lots of ways the pushing threshold could be done
> differently, but to focus on that misses the point that intents
> violate forwards progress guarantees rolling transactions are
> supposed to provide the log. I simply made the "access overhead"
> argument as a supporting point that this change is also likely to
> have unintended side effects...
> 

The argument stated was that the relog filtering was redundant logic. I
think that misinterprets the logic. If there's a correctness issue, then
I'd much prefer to focus on that..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>