Re: [PATCH RFC] xfs: automatic log item relogging experiment

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 29 Oct 2019 09:57:59 -0400

On Mon, Oct 28, 2019 at 11:20:38AM -0700, Darrick J. Wong wrote:
> [/me finally jumps in with his perspectives]
> 
> On Mon, Oct 28, 2019 at 10:26:12AM -0400, Brian Foster wrote:
> > On Sat, Oct 26, 2019 at 10:16:28AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 25, 2019 at 08:41:17AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 25, 2019 at 09:43:08AM +1100, Dave Chinner wrote:
> > > > > On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> > > > > > An experimental mechanism to automatically relog specified log
> > > > > > items.  This is useful for long running operations, like quotaoff,
> > > > > > that otherwise risk deadlocking on log space usage.
> > > > > > 
> > > > > > Not-signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > > > > ---
> > > > > > 
...
> > 
> > > i.e. if the intent falls off the tail of the log, then we have the
> > > situation of having a partial operation already on stable storage
> > > but no way of tracking how much work has been done. Hence I think we
> > > must confine relogging of intents to a single permanent transaction
> > > context, and intents that can get relogged must have an intent done
> > > logged with them to indicate progress that has been made. And, of
> > > course, log recovery will need mods to handle this, too.
> > > 
> > 
> > I'm not quite following what you mean by an intent falling off the tail
> > of the log. This patch modifies quotaoff because it presents a simple
> > and preexisting example of the problem it attempts to solve. Does this
> > scenario apply to that use case, or are you considering a more involved
> > use case here (such as btree reconstruction)?
> 
> /me wonders, for quotaoff we end up with a transaction sequence:
> 
> [start quotaoff] <potentially large pile of transactions> [end quotaoff]
> 
> And AFAICT the quotaoff item makes it so that the log recovery code
> doesn't bother recovering dquot items, right?  So I'm not sure what
> intermediate progress information we need to know about?  Either we've
> finished quotaoff and won't be logging more dquots, or we haven't.
> 
> I might just be confused, but I think your worry here is that something
> like this will happen?
> 
> Log an intent, start doing the work:
> 
> [log intent][start doing work]..........................
> 
> Then other threads log some other work and the head get close to the end:
> 
> ............[start doing work][other work]..............
> 
> So we relog the intent and the other threads continue logging more work.
> Then the log wraps, zapping the first intent:
> 
> [new work]..[start doing work][other work][relog intent]
> 
> Now we crash.  Log recovery recovers "start doing work", then it
> recovers "other work", then it notes "relog intent", and finally it
> recovers "new work".  Next, it decides to restart "relog intent", which
> then trips over the filesystem because the item recovery code is too
> dumb to realize that we already did some, but not all of, the intended
> work?
> 

Hm, Ok.. that sort of makes sense in terms of a problem statement. I
don't see it how it necessarily applies to the mechanism in this patch
because the use case covers extents that don't carry partial progress as
such. For example, from a quotaoff recovery standpoint, there's no real
difference whether it sees the original intent or a relogged variant
(the issue of seeing both notwithstanding). My understanding was the
same thing sort of applies for the repair use case because even though
an EFI does support partial progress relogging (via EFD, as Dave points
out), the free operations would only ever occur in error scenarios.

> (Did I get that right?)
> 
> At least for repair the transaction sequence looks roughly like this:
> 
> [allocate, EFI0][allocate, EFI1] <long wait to write new btree> \
> 	[commit root, EFD0, EFD1, EFI2, EFI3][free, EFD2][free, EFD3]
> 
> EFI[01] are for the new btree blocks, EFI[23] are to kill the old ones.
> 

Yep, that's my understanding from your previous descriptions. So
relogging (at least as defined by this version of the rfc) only really
applies to EFI[01] and only for the duration of the btree
reconstruction. It obviously needs to cancel to commit the associated
EFDs on repair completion (success), but the same would probably hold
true if the repair failed and we actually wanted to return the blocks to
the free space trees instead of shutdown.

> So there's not really any intermediate progress that's going through the
> log -- either we're ready to commit the new root and mess around with
> our EFI set, or we're not.  There's no intermediate progress to trip
> over.
> 

Indeed.

> (Also it occurs to me just now that since btree reconstruction uses
> delwri_submit before committing the new root, it really ought to force
> the blocks to media via an explicit flush at the end or something to
> make sure that we've really committed the new btree blocks to stable
> storage, right?)
> 

IIRC btree block population are all unlogged changes (hence the delwri
queue), so a flush probably makes sense. Making the final root commit
transaction synchronous might be appropriate, since that eliminates risk
of losing the repair via a crash after xfs_scrub returns back from
kernel space but before the transaction actually commits to the physical
log. If that sync trans commit occurs after delwri queue I/O completion,
the REQ_PREFLUSH associated with log buffer I/O might be sufficient to
guarantee the new btree is persistent.

...
> > > 
> > 
> > Ok. The context provided above helps and I like the idea in general.
> > I'm still missing some pieces around how this would be used in something
> > like the btree reconstruction case, however.
> > 
> > Suppose the intents that need relogging are committed with associated
> > callbacks and context, the btree reconstruction is in progress and the
> > callback is invoked. Would the callback handler simply just roll the
> > transaction in this case and relog the attached items? (If the former,
> > isn't it eventually a deadlock vector to roll from log commit context
> > once the rolling transaction runs out of ticket counts?). Or are you
> > anticipating something more complex in terms of the callback notifying
> > the repair sequence (somehow) that it needs to relog its rolling context
> > transaction at the next opportunity?
> 
> I was thinking (this morning, on IRC :P) that log items could have a
> "hey we're getting full, please relog" handler.  When the head gets more
> than (say) 75% of the log size past the tail, we call the handler, if
> one was supplied.
> 
> Repair of course supplies a handle, so it just kicks off a workqueue
> item to allocate a new transaction (hah!) that logs an intent done for
> the existing intent, logs a new identical intent, stuffs that back into
> repair's bookkeeping, and commits the transaction...
> 

Yep, I think that's what Dave was getting at with regard to resurrecting
the old transaction callback. That would invoke at roughly the same time
I hardcoded adding to the relog list in this patch (i.e. once per CIL
checkpoint) and would generally trigger the relog/roll event for tracked
items.

The question at that point is more of how to manage the transaction (and
associated reservation) and general communication between the original
intent committer context, the ongoing relog context (i.e. the relog
task/workqueue) and then the original context again in order to cancel
any further relogging of a particular intent that is about to complete.

In thinking more about it, I am starting to wonder whether the initial
transaction roll thing makes sense. For one, the initial context can't
hold a memory reference to a transaction that some other context is
going to roll in the first place because that trans just becomes freed
memory on the first roll event. It might make more sense for the initial
transaction context to roll internally and transfer the new tp to
another context, but that is slightly shady too in that rolled
transactions share a log ticket and the log ticket is explicitly tied to
the allocating task for reservation scheduling purposes (i.e. see
->t_ticket->t_task).

The more I think about that, the more it _seems_ a bit more logical to
do something analogous to the CIL itself where relog context holds a
reservation ticket and the transaction processing is abstracted slightly
from the reservation management. That might facilitate several
independent tasks to acquire and transfer relog reservation to a central
relog context/ticket that could potentially relog many (unrelated) items
in its own transaction(s). That gets a little hairy itself because we
essentially have a dynamic reservation ticket where various relogged
items might be coming and going after each roll, etc., so I'd definitely
need to think about that some more. Perhaps it's good enough for a POC
to just let the relog task allocate its own fixed size transaction for
the time being (as you suggest above) and worry out the proper
reservation management as a next step..

Brian

> > I also think it's worth distinguishing between quotaoff and the repair
> > use case at this point. As noted above, this doesn't really address the
> > former and if we do take this callback approach, I'd like to find a way
> > to apply it there if at all possible. Admittedly, quotaoff is a bit of a
> > hacky use case as it is, so I'd even be fine with something along the
> > lines of running it it two separate (but coordinated) tasks[1] (i.e., one
> > dedicated to rolling and relogging the start intent and another to do
> > the actual quotaoff work), as long as the approach is ultimately safe
> > and resolves the problem. Thoughts?
> > 
> > [1] That could be anything from a specific quotaoff task to a more
> > generic background relog worker that could be shared across users and
> > batch to fewer transactions.
> 
> ...like this, perhaps.
> 
> > > > FWIW, I'm also wondering if this lends itself to some form of batching
> > > > for if we get to the point of relogging a large number of small items.
> > > > For example, consider a single dedicated/relogging transaction for many
> > > > small (or related) intents vs. N independent transactions processing in
> > > > the background. This is something that could came later once the
> > > > fundamentals are worked out, however.
> > > 
> > > That's exactly the problem I see needing to be solved for the
> > > "rebuild btrees in free space" type of long running operation that
> > > repair will need to run.
> > > 
> > > i.e. an out-of-line btree rebuild that involves allocating space
> > > that should be freed again if the rebuild fails or we crash during
> > > the rebuild. Hence an EFI needs to be held for each allocation we
> > > do until we get the tree rebuilt and do the atomic swap of the root
> > > block. That atomic swap transaction also needs to commit all the
> > > EFDs, as the space is now in use and we need to cancel the EFIs.(+)
> > > 
> > > Hence the rolling transaction will need to relog the primary "btree
> > > rebuild in progress" intent/intent-done pair as well as all the all
> > > the EFI/EFD pairs it holds for the extents it has allocated so far.
> > > The subsystem will have to co-ordinate the intent commit callback
> > > notification with it's ongoing transactional work that is done under
> > > it's rolling transaction context.
> > > 
> > > IOWs, there's a whole lot more work needed than just updating a
> > > single intent pair in "btree rebuild" situation like this. I also
> > > don't really see how a "log internal" relogging mechanism based on
> > > stealing reservations will be able to handle the complexity of
> > > atomic multiple intent state updates. That requires
> > > subsystem/application specific knowledge to understand the
> > > relationship between all the intents that need to be relogged....
> > > 
> > 
> > Yeah, though to be fair you're attributing more responsibility and thus
> > more complexity than I initially intended for this mechanism. This was
> > intended to be a "don't deadlock on this unchanged item" mechanism and
> > thus fairly isolated/limited in use. I'd compare it to something like
> > ordered buffers in terms of complexity level. I.e., a low level
> > mechanism for specific use cases and to be managed/understood properly
> > by associated users.
> > 
> > I've no issue with moving in the direction discussed here to facilitate
> > more complex higher level mechanisms (like tracking operational progress
> > of btree reconstruction, etc.), I'm just saying that the complexity
> > argument you make here changes as the requirements do.
> > 
> > > (+) The deferops and/or EFI code probably needs modification to
> > > support non-freeing, delayed EFD processing like this - it needs to
> > > log and track the intents it holds and relog them, but not free them
> > > or run EFDs until a later point in time. i.e. it becomes 2-part
> > > deferred op mechanism, with separate control of the intent-done
> > > processing phase. I'd like to use this mechanism for DIO and
> > > buffered writeback allocation (so we don't need to use unwritten
> > > extents), but I haven't had time to dig into it yet...
> 
> Heh, I suspected this was going to come up in this discussion. :)
> 
> > I can definitely see opening up the dfops interface to drive/control
> > intent relogging vs. progress updates vs. completion. This is somewhat
> > complicated by tricks like reusing certain intents in non-traditional
> > ways, such as completing an EFI with an EFD without actually freeing
> > blocks. Regardless, my questions to this point are more around
> > usage/semantics of the log commit callback and using it to manage
> > transaction rolls. Once those building blocks are settled, I'm sure we
> > can work out a reasonable interface.
> 
> <nod>
> 
> --D
> 
> > 
> > Brian
> > 
> > > > All in all this sounds interesting. I still need to grok the transaction
> > > > reservation/ownership semantics you propose re: the questions above, but
> > > > I do like the prospect of reusing existing mechanisms and thus being
> > > > able to more easily handle generic log items. Thanks for the
> > > > feedback...
> > > 
> > > Yeah, i'd much prefer to implement something that fits within the
> > > existing model, too :)
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@xxxxxxxxxxxxx
> >