Re: [PATCH 00/10] xfs: automatic relogging

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 20 Jul 2020 13:58:40 +1000

On Thu, Jul 09, 2020 at 08:15:30AM -0400, Brian Foster wrote:
> On Wed, Jul 08, 2020 at 09:44:28AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 07, 2020 at 07:37:43AM -0400, Brian Foster wrote:
> > > > 
> > > 
> > > Thanks. I think I get the general idea. We're reworking the
> > > ->iop_relog() handler to complete and replace the current intent (rather
> > > than just relog the original intent, which is what this series did for
> > > the quotaoff case) in the current dfops transaction and allow the dfops
> > > code to update its reference to the item. The part that's obviously
> > > missing is some kind of determination on when we actually need to relog
> > > the outstanding intents vs. using a fixed roll count.
> > 
> > <nod>  I don't consider myself sufficiently ail-smart to know how to do
> > that part. :)
> > 
> > > I suppose we could do something like I was mentioning in my other reply
> > > on the AIL pushing issue Dave pointed out where we'd set a bit on
> > > certain items that are tail pinned and in need of relog. That sounds
> > > like overkill given this use case is currently self-contained to dfops.
> > 
> > That might be a useful optimization -- every time defer_finish rolls the
> > transaction, check the items to see if any of them have
> > XFS_LI_RELOGMEPLEASE set, and if any of them do, or we hit our (now
> > probably higher than 7) fixed roll count, we'll relog as desired to keep
> > the log moving forward.
> > 
> 
> It's an optimization in some sense to prevent unnecessary relogs, but
> the intent would be to avoid the need for a fixed count by notifying
> when a relog is needed to a transaction that should be guaranteed to
> have the reservation necessary to do so. I'm not sure it's worth the
> complexity if there were some reason we still needed to fall back to a
> hard count.

FWIW, relogging would only ever be necessary if
xfs_log_item_in_current_chkpt() returned false for an item we are
considering relogging. Otherwise, it's already queued for the next
journal checkpoint and there's no need to relog it until the
checkpoint commits....

> > > Perhaps the other idea of factoring out the threshold determination
> > > logic from xlog_grant_push_ail() might be useful.
> > > 
> > > For example, if the current free reservation is below the calculated
> > > threshold (with need_bytes == 0), return a threshold LSN based on the
> > > current tail. Instead of using that to push the AIL, compare it to
> > > ->li_lsn of each intent and relog any that are inside the threshold LSN
> > > (which will probably be all of them in practice since they are part of
> > > the same transaction). We'd probably need to identify intents that have
> > > been recently relogged so the process doesn't repeat until the CIL
> > > drains and the li_lsn eventually changes. Hmm.. I did have an
> > > XFS_LI_IN_CIL state tracking patch around somewhere for debugging
> > > purposes that might actually be sufficient for that. We could also
> > > consider stashing a "relog push" LSN somewhere (similar to the way AIL
> > > pushing works) and perhaps use that to avoid repeated relogs on a chain,
> > > but it's not immediately clear to me how well that would fit into the
> > > dfops mechanism...
> > 
> > ...is there a sane way for dfops to query the threshold LSN so that it
> > could compare against the li_lsn of each item it holds?
> > 
> 
> I'd start with just trying to reuse the logic in xlog_grant_push_ail()
> (i.e. just factor out the AIL push). That function starts with a check
> on available log reservation to filter out unnecessary pushes, then
> calculates the AIL push LSN by adding the amount of log space we need to
> free up to the current log tail. In this case we're not pushing the AIL,
> but I think we'd be able to use the same threshold calculation logic to
> determine when to relog intents that happen to reside within the range
> from the current tail to the calculated threshold.

Hmmmm. I'm kinda wanting to pull the AIL away from the demand-based
tail pushing that xlog_grant_push_ail() does. Distance from the tail
doesn't really tell us how quickly that distance will be consumed
by ongoing operations....

One of the problems we have at the moment is that the AIL will sit
at under 75% full and do nothing at all (because the
xlog_grant_push_ail() call does nothing at <75% full) until we get
memory pressure or the log worker comes along every 30s and calls
xfs_ail_push_all().

The result is that when we are under bursty workloads, we don't
keep pushing metadata out to free up log space - our working space
to soak up bursts is only 25% of the journal, which we can fill in a
couple of seconds, even on a 2GB log. SO under sustained bursty
worklaods, we really only have 25% of the log available at most,
rather than continuing to write back past the 75% threshold so the
next burst can have, say, 50% of the log space available to soak up
the burst.

i.e. if we have frequent bursts and the IO subsystem can soak up the
metadata writeback rate, we should be writing back faster so that
the bursts can hit the transaction reservation fast path for
longer...

IOWs, I'd like to see the AIL move more towards a mechanism that
balances a time-averaged rate of inserts (journal checkpoint
completions) with a time-averaged rate of removals (metadata IO
completions) rather than working to a free fixed space target.
If the workload is sustained, then this effective ends up the same
as we have now with transactions waiting for log reservation space,
but we should end up draining the AIL down further when than we
currently do when incoming work tails off.

I suspect that we could work into this a "need to be relogged within
X seconds" trigger for active reloggable items in the AIL, such that
if the top layer deferops sees the flag set on any item in the
processing of the deferops it relogs all the reloggable items in the
current set...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx