Re: [PATCH 00/10] xfs: automatic relogging

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 26 Aug 2020 08:17:10 -0400

On Mon, Jul 20, 2020 at 01:58:40PM +1000, Dave Chinner wrote:
> On Thu, Jul 09, 2020 at 08:15:30AM -0400, Brian Foster wrote:
> > On Wed, Jul 08, 2020 at 09:44:28AM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 07, 2020 at 07:37:43AM -0400, Brian Foster wrote:
> > > > > 
> > > > 
> > > > Thanks. I think I get the general idea. We're reworking the
> > > > ->iop_relog() handler to complete and replace the current intent (rather
> > > > than just relog the original intent, which is what this series did for
> > > > the quotaoff case) in the current dfops transaction and allow the dfops
> > > > code to update its reference to the item. The part that's obviously
> > > > missing is some kind of determination on when we actually need to relog
> > > > the outstanding intents vs. using a fixed roll count.
> > > 
> > > <nod>  I don't consider myself sufficiently ail-smart to know how to do
> > > that part. :)
> > > 
> > > > I suppose we could do something like I was mentioning in my other reply
> > > > on the AIL pushing issue Dave pointed out where we'd set a bit on
> > > > certain items that are tail pinned and in need of relog. That sounds
> > > > like overkill given this use case is currently self-contained to dfops.
> > > 
> > > That might be a useful optimization -- every time defer_finish rolls the
> > > transaction, check the items to see if any of them have
> > > XFS_LI_RELOGMEPLEASE set, and if any of them do, or we hit our (now
> > > probably higher than 7) fixed roll count, we'll relog as desired to keep
> > > the log moving forward.
> > > 
> > 
> > It's an optimization in some sense to prevent unnecessary relogs, but
> > the intent would be to avoid the need for a fixed count by notifying
> > when a relog is needed to a transaction that should be guaranteed to
> > have the reservation necessary to do so. I'm not sure it's worth the
> > complexity if there were some reason we still needed to fall back to a
> > hard count.
> 
> FWIW, relogging would only ever be necessary if
> xfs_log_item_in_current_chkpt() returned false for an item we are
> considering relogging. Otherwise, it's already queued for the next
> journal checkpoint and there's no need to relog it until the
> checkpoint commits....
> 
> > > > Perhaps the other idea of factoring out the threshold determination
> > > > logic from xlog_grant_push_ail() might be useful.
> > > > 
> > > > For example, if the current free reservation is below the calculated
> > > > threshold (with need_bytes == 0), return a threshold LSN based on the
> > > > current tail. Instead of using that to push the AIL, compare it to
> > > > ->li_lsn of each intent and relog any that are inside the threshold LSN
> > > > (which will probably be all of them in practice since they are part of
> > > > the same transaction). We'd probably need to identify intents that have
> > > > been recently relogged so the process doesn't repeat until the CIL
> > > > drains and the li_lsn eventually changes. Hmm.. I did have an
> > > > XFS_LI_IN_CIL state tracking patch around somewhere for debugging
> > > > purposes that might actually be sufficient for that. We could also
> > > > consider stashing a "relog push" LSN somewhere (similar to the way AIL
> > > > pushing works) and perhaps use that to avoid repeated relogs on a chain,
> > > > but it's not immediately clear to me how well that would fit into the
> > > > dfops mechanism...
> > > 
> > > ...is there a sane way for dfops to query the threshold LSN so that it
> > > could compare against the li_lsn of each item it holds?
> > > 
> > 
> > I'd start with just trying to reuse the logic in xlog_grant_push_ail()
> > (i.e. just factor out the AIL push). That function starts with a check
> > on available log reservation to filter out unnecessary pushes, then
> > calculates the AIL push LSN by adding the amount of log space we need to
> > free up to the current log tail. In this case we're not pushing the AIL,
> > but I think we'd be able to use the same threshold calculation logic to
> > determine when to relog intents that happen to reside within the range
> > from the current tail to the calculated threshold.
> 
> Hmmmm. I'm kinda wanting to pull the AIL away from the demand-based
> tail pushing that xlog_grant_push_ail() does. Distance from the tail
> doesn't really tell us how quickly that distance will be consumed
> by ongoing operations....
> 
> One of the problems we have at the moment is that the AIL will sit
> at under 75% full and do nothing at all (because the
> xlog_grant_push_ail() call does nothing at <75% full) until we get
> memory pressure or the log worker comes along every 30s and calls
> xfs_ail_push_all().
> 
> The result is that when we are under bursty workloads, we don't
> keep pushing metadata out to free up log space - our working space
> to soak up bursts is only 25% of the journal, which we can fill in a
> couple of seconds, even on a 2GB log. SO under sustained bursty
> worklaods, we really only have 25% of the log available at most,
> rather than continuing to write back past the 75% threshold so the
> next burst can have, say, 50% of the log space available to soak up
> the burst.
> 
> i.e. if we have frequent bursts and the IO subsystem can soak up the
> metadata writeback rate, we should be writing back faster so that
> the bursts can hit the transaction reservation fast path for
> longer...
> 

(Sorry for the delayed response. I ended up on leave before getting to
this and had to catch back up).

Sure, this is all characteristic of the current AIL pushing mechanism.
The objective above was just to identify how an external component
(dfops) could determine when to efficiently relog a particular item
pinned by the AIL without having to resort to guessing or crude roll
counts, etc. The solution is to essentially reuse AIL logic (flaws and
all) to reasonably determine whether the associated item is pinning the
tail against push pressure. If there's reason to significantly rework
the fundamental AIL pushing mechanism, this use case can most likely
pick up and continue to reuse whatever logic results from that without
much trouble.

> IOWs, I'd like to see the AIL move more towards a mechanism that
> balances a time-averaged rate of inserts (journal checkpoint
> completions) with a time-averaged rate of removals (metadata IO
> completions) rather than working to a free fixed space target.
> If the workload is sustained, then this effective ends up the same
> as we have now with transactions waiting for log reservation space,
> but we should end up draining the AIL down further when than we
> currently do when incoming work tails off.
> 

I'm not quite sure I follow... what this basically sounds like is we'd
have some insert side tracking calculating and storing a running rate of
something like 'items inserted per second' somewhere in the xfs_ail, and
then rather than simply receiving a push target, xfsaild becomes a bit
more self-deterministic and uses that current insert rate to set or
adjust the push target itself. Or perhaps does so based on an analogous
tail moving rate.. hm?

I guess I can see how the mechanics of something like that would work,
but I'm not totally clear on what the rate and translation to push
target would look like. For one, items and checkpoints are arbitrary
sizes, so it seems that something based on consumed log space (i.e.
bytes) might be more accurate. Inserts can also move (i.e. relog) items
that are already AIL resident, so that should be considered somehow
(does it affect the rate?). The current scheme also incorporates log
reservation, which seems like a critical factor since it depends on free
log space. It's not clear to me how an I/O based insert rate alone might
encapsulate all of that, but I'm sure I'm missing much of the bigger
picture..

> I suspect that we could work into this a "need to be relogged within
> X seconds" trigger for active reloggable items in the AIL, such that
> if the top layer deferops sees the flag set on any item in the
> processing of the deferops it relogs all the reloggable items in the
> current set...
> 

Right.. I haven't gone back and reviewed the earlier discussion, but I'm
pretty sure we considered such a flag for the current approach. That it
wasn't necessary atm was just an implementation detail because the
current push logic operates with fixed LSNs and the log item knows its
place in the log...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>