Re: [PATCH RFC] xfs: automatic log item relogging experiment

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 25 Oct 2019 08:41:17 -0400

On Fri, Oct 25, 2019 at 09:43:08AM +1100, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> > An experimental mechanism to automatically relog specified log
> > items.  This is useful for long running operations, like quotaoff,
> > that otherwise risk deadlocking on log space usage.
> > 
> > Not-signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > ---
> > 
> > Hi all,
> > 
> > This is an experiment that came out of a discussion with Darrick[1] on
> > some design bits of the latest online btree repair work. Specifically,
> > it logs free intents in the same transaction as block allocation to
> > guard against inconsistency in the event of a crash during the repair
> > sequence. These intents happen pin the log tail for an indeterminate
> > amount of time. Darrick was looking for some form of auto relog
> > mechanism to facilitate this general approach. It occurred to us that
> > this is the same problem we've had with quotaoff for some time, so I
> > figured it might be worth prototyping something against that to try and
> > prove the concept.
> 
> Interesting idea. :)
> 
> > 
> > Note that this is RFC because the code and interfaces are a complete
> > mess and this is broken in certain ways. This occasionally triggers log
> > reservation overrun shutdowns because transaction reservation checking
> > has not yet been added, the cancellation path is overkill, etc. IOW, the
> > purpose of this patch is purely to test a concept.
> 
> *nod*
> 
> > The concept is essentially to flag a log item for relogging on first
> > transaction commit such that once it commits to the AIL, the next
> > transaction that happens to commit with sufficient unused reservation
> > opportunistically relogs the item to the current CIL context. For the
> > log intent case, the transaction that commits the done item is required
> > to cancel the relog state of the original intent to prevent further
> > relogging.
> 
> Makes sense, but it seems like we removed the hook that would be
> used by transactions to implement their own relogging on CIL commit
> some time ago because nothign had used it for 15+ years....
> 

Interesting, I'm not familiar with this...

> > In practice, this allows a log item to automatically roll through CIL
> > checkpoints and not pin the tail of the log while something like a
> > quotaoff is running for a potentially long period of time. This is
> > applied to quotaoff and focused testing shows that it avoids the
> > associated deadlock.
> 
> Hmmm. How do we deal with multiple identical intents being found in
> checkpoints with different LSNs in log recovery?
> 

Good question. I was kind of thinking they would be handled like
normally relogged items, but I hadn't got to recovery testing yet. For
quotaoff, no special handling is required because we simply turn off
quota flags (i.e. no filesystem changes are required) when the intent is
seen and don't do anything else with the log item. FWIW, we don't even
seem to check for an associated quotaoff_end item.

For something more involved like an EFI, it looks like we'd create a
duplicate log item for the intent and I suspect that would lead to a
double processing attempt. So this would require some further changes to
handle generic intent relogging properly. I don't think it's that
difficult of a problem; we do already allow for relogs of other things
obviously, we just don't currently do any tracking of already seen
intents. That said, this could probably be considered an ABI change if
older kernels don't know how to process the condition correctly, so we'd
have to guard against that with a feature bit or some such when a
filesystem isn't unmounted cleanly.

More broadly, this implies that we don't ever currently relog intents so
this is something that will come along with any implementation that
intends to do so unless we adopt an alternative along the lines of what
dfops processing does. Dfops completes the original intent and creates a
new one on each internal transaction roll. My early thought was that
would be overkill for relogs that are more intended to facilitate
unrelated progress (i.e. not pin the log tail) as opposed to track
progress in a particular operation (i.e. dfops freed 1 extent out of a 2
extent EFI, roll, complete the original EFI and create a new EFI with 1
extent left). The item never really changes in the first case where it
does in the second.

> > Thoughts, reviews, flames appreciated.
> > 
> > [1] https://lore.kernel.org/linux-xfs/20191018143856.GA25763@bfoster/
> > 
> >  fs/xfs/xfs_log_cil.c     | 69 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_priv.h    |  6 ++++
> >  fs/xfs/xfs_qm_syscalls.c | 13 ++++++++
> >  fs/xfs/xfs_trace.h       |  2 ++
> >  fs/xfs/xfs_trans.c       |  4 +++
> >  fs/xfs/xfs_trans.h       |  4 ++-
> >  6 files changed, 97 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index a1204424a938..b2d8b2d54df6 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -75,6 +75,33 @@ xlog_cil_iovec_space(
> >  			sizeof(uint64_t));
> >  }
> >  
> > +static void
> > +xlog_cil_relog_items(
> > +	struct xlog		*log,
> > +	struct xfs_trans	*tp)
> > +{
> > +	struct xfs_cil		*cil = log->l_cilp;
> > +	struct xfs_log_item	*lip;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_DIRTY);
> > +
> > +	if (list_empty(&cil->xc_relog))
> > +		return;
> > +
> > +	/* XXX: need to check trans reservation, process multiple items, etc. */
> > +	spin_lock(&cil->xc_relog_lock);
> > +	lip = list_first_entry_or_null(&cil->xc_relog, struct xfs_log_item, li_cil);
> > +	if (lip)
> > +		list_del_init(&lip->li_cil);
> > +	spin_unlock(&cil->xc_relog_lock);
> > +
> > +	if (lip) {
> > +		xfs_trans_add_item(tp, lip);
> > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +		trace_xfs_cil_relog(lip);
> > +	}
> 
> I don't think this is safe - the log item needs to be locked to be
> joined to a transaction. Hence this can race with whatever is
> committing the done intent on the object and removing it from the
> relog list and so the item could be clean (and potentially even
> freed) by the time we go to add it to this transaction and mark it
> dirty again...
> 

This was targeted to intents for the time being, which AFAIA are
unshared structures in that there is no associated metadata object lock.
It's possible I've missed something by narrowly focusing on the quotaoff
case, however. I probably should have been more clear on that in the
commit log. For such intents, the item can't be dropped from the AIL (or
freed) until the done item commits, otherwise we'd have fs consistency
issues.

I was planning to eventually look into more generic log item relogging
if this basic experiment panned out. That would certainly be more
complicated in terms of having to transfer and manage lock/reference
state of the associated metadata objects, etc., as you point out.

> > +}
> > +
> >  /*
> >   * Allocate or pin log vector buffers for CIL insertion.
> >   *
> > @@ -1001,6 +1028,8 @@ xfs_log_commit_cil(
> >  	struct xfs_log_item	*lip, *next;
> >  	xfs_lsn_t		xc_commit_lsn;
> >  
> > +	xlog_cil_relog_items(log, tp);
> 
> Hmmm. This means that when there are relog items on the list, all
> the transactional concurrency is going to be put onto the
> xc_relog_lock spin lock. THis is potentially a major lock contention
> point, especially if there are lots of items that need relogging.
> 

Potentially, yes. The use cases I was considering were basically limited
to quotaoff and scrub, so I'm not sure performance is a huge concern for
now. The approach is opportunistic in general, so that also means this
could be implemented to avoid this contention without reducing the
effectiveness (once some actual reservation checking and whatnot is in
place).

> > +
> >  	/*
> >  	 * Do all necessary memory allocation before we lock the CIL.
> >  	 * This ensures the allocation does not deadlock with a CIL
> > @@ -1196,6 +1225,8 @@ xlog_cil_init(
> >  	spin_lock_init(&cil->xc_push_lock);
> >  	init_rwsem(&cil->xc_ctx_lock);
> >  	init_waitqueue_head(&cil->xc_commit_wait);
> > +	INIT_LIST_HEAD(&cil->xc_relog);
> > +	spin_lock_init(&cil->xc_relog_lock);
> >  
> >  	INIT_LIST_HEAD(&ctx->committing);
> >  	INIT_LIST_HEAD(&ctx->busy_extents);
> > @@ -1223,3 +1254,41 @@ xlog_cil_destroy(
> >  	kmem_free(log->l_cilp);
> >  }
> >  
> > +void
> > +xfs_cil_relog_item(
> > +	struct xlog		*log,
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_cil		*cil = log->l_cilp;
> > +
> > +	ASSERT(test_bit(XFS_LI_RELOG, &lip->li_flags));
> > +	ASSERT(list_empty(&lip->li_cil));
> 
> So this can't be used for things like inodes and buffers?
> 

Not at this stage, but I think it's worth considering from a design
standpoint because I'd expect to be able to cover that case eventually.
If there are big picture issues/concerns, it's best to think about them
now. ;)

> > +	spin_lock(&cil->xc_relog_lock);
> > +	list_add_tail(&lip->li_cil, &cil->xc_relog);
> > +	spin_unlock(&cil->xc_relog_lock);
> > +
> > +	trace_xfs_cil_relog_queue(lip);
> > +}
> > +
> > +bool
> > +xfs_cil_relog_steal(
> > +	struct xlog		*log,
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_cil		*cil = log->l_cilp;
> > +	struct xfs_log_item	*pos, *ppos;
> > +	bool			ret = false;
> > +
> > +	spin_lock(&cil->xc_relog_lock);
> > +	list_for_each_entry_safe(pos, ppos, &cil->xc_relog, li_cil) {
> > +		if (pos == lip) {
> > +			list_del_init(&pos->li_cil);
> > +			ret = true;
> > +			break;
> > +		}
> > +	}
> > +	spin_unlock(&cil->xc_relog_lock);
> 
> This is a remove operation, not a "steal" operation. But why are
> we walking the relog list to find it? It would be much better to use
> a flag to indicate what list the item is on, and then this just
> becomes
> 

Er, yeah this is a mess. I was actually planning to do exactly this,
only with the dirty bit instead of a custom bit. The reason for the
current code is basically due to a combination of some bugs I had in the
code and things getting implemented out of order such that I already had
this function by the time I realized I could just use the dirty bit. I
didn't feel the need to rework it for the purpose of the POC. Of course
that depends on some of the same intent-only assumptions as discussed
above, so a relog specific bit is probably a more appropriate long term
solution.

> 	spin_lock(&cil->xc_relog_lock);
> 	if (test_and_clear_bit(XFS_LI_RELOG_LIST, &lip->li_flags))
> 		list_del_init(&pos->li_cil);
> 	spin_unlock(&cil->xc_relog_lock);
> 
> 
> 
> > @@ -556,6 +558,16 @@ xfs_qm_log_quotaoff_end(
> >  					flags & XFS_ALL_QUOTA_ACCT);
> >  	xfs_trans_log_quotaoff_item(tp, qoffi);
> >  
> > +	/*
> > +	 * XXX: partly open coded relog of the start item to ensure no relogging
> > +	 * after this point.
> > +	 */
> > +	clear_bit(XFS_LI_RELOG, &startqoff->qql_item.li_flags);
> > +	if (xfs_cil_relog_steal(mp->m_log, &startqoff->qql_item)) {
> > +		xfs_trans_add_item(tp, &startqoff->qql_item);
> > +		xfs_trans_log_quotaoff_item(tp, startqoff);
> > +	}
> 
> Urk. :)
> 
> > @@ -863,6 +864,9 @@ xfs_trans_committed_bulk(
> >  		if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
> >  			continue;
> >  
> > +		if (test_bit(XFS_LI_RELOG, &lip->li_flags))
> > +			xfs_cil_relog_item(lip->li_mountp->m_log, lip);
> 
> Ok, so this is moving the item from commit back onto the relog list.
> This is going to hammer the relog lock on workloads where there is a
> lot of transaction concurrency and a substantial number of items on
> the relog list....
> 

Hmm.. so concurrency is somewhat limited on this end of things by this
being in log completion context. I suspect you are referring to the
broader concurrency between log completion and incoming transaction
commits, which I think is a fair point. I'd have to think more about
that when I got to polishing the relog processing stuff. This most
likely would require some balancing between reducing contention and
maintaining effectiveness.

> ----
> 
> Ok, so I mentioned that we removed the hooks that could have been
> used for this some time ago.
> 
> What we actually want here is a notification that an object needs
> relogging. I can see how appealing the concept of automatically
> relogging is, but I'm unconvinced that we can make it work,
> especially when there aren't sufficient reservations to relog
> the items that need relogging.
> 

I'm not unconvinced it can be made to work, but I agree that more work
is required to demonstrate that. I'm open to ideas for cleaner
solutions. That's why I posted this in its current form. :)

> commit d420e5c810bce5debce0238021b410d0ef99cf08
> Author: Dave Chinner <dchinner@xxxxxxxxxx>
> Date:   Tue Oct 15 09:17:53 2013 +1100
> 
>     xfs: remove unused transaction callback variables
>     
>     We don't do callbacks at transaction commit time, no do we have any
>     infrastructure to set up or run such callbacks, so remove the
>     variables and typedefs for these operations. If we ever need to add
>     callbacks, we can reintroduce the variables at that time.
>     
>     Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
>     Reviewed-by: Ben Myers <bpm@xxxxxxx>
>     Signed-off-by: Ben Myers <bpm@xxxxxxx>
> 
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 09cf40b89e8c..71c835e9e810 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -85,18 +85,11 @@ struct xfs_item_ops {
>  #define XFS_ITEM_LOCKED                2
>  #define XFS_ITEM_FLUSHING      3
>  
> -/*
> - * This is the type of function which can be given to xfs_trans_callback()
> - * to be called upon the transaction's commit to disk.
> - */
> -typedef void (*xfs_trans_callback_t)(struct xfs_trans *, void *);
> -
>  /*
>   * This is the structure maintained for every active transaction.
>   */
>  typedef struct xfs_trans {
>         unsigned int            t_magic;        /* magic number */
> -       xfs_log_callback_t      t_logcb;        /* log callback struct */
>         unsigned int            t_type;         /* transaction type */
>         unsigned int            t_log_res;      /* amt of log space resvd */
>         unsigned int            t_log_count;    /* count for perm log res */
> 
> That's basically the functionality we want here - when the log item
> hits the journal, we want a callback to tell us so we can relog it
> ourselves if deemed necessary. i.e. it's time to reintroduce the
> transaction/log commit callback infrastructure...
> 

As noted earlier, this old mechanism is new to me.. This commit just
appears to remove an unused hook and the surrounding commits don't look
related. Was this previously used for something in particular?

> This would get used in conjunction with a permanent transaction
> reservation, allowing the owner of the object to keep it locked over
> transaction commit (while whatever work is running between intent
> and done), and the transaction commit turns into a trans_roll. Then
> we reserve space for the next relogging commit, and go about our
> business.
> 

Hmm.. IIUC the original intent committer rolls instead of commits to
reacquire relogging reservation, but what if the submitter needs to
allocate and commit unrelated transactions before it gets to the point
of completing the intent? I was originally thinking about "pre-reserving
and donating" relogging reservation along these lines, but I thought
reserving transactions like this (first for the intent+relog, then for
whatever random transaction is next) was a potential deadlock vector.
Perhaps it's not if the associated items have all been committed to the
log subsystem. It also seemed unnecessary given our current design of
worst case reservation sizes, but that's a separate topic and may only
apply to intents (which are small compared to metadata associated with
generic log items).

So are you suggesting ownership of the committed transaction transfers
to the log subsystem somehow or another? For example, we internally roll
and queue the _transaction_ (not the log item) for relogging while from
the caller's perspective the transaction has committed? Or the
additional reservation is pulled off and the transaction commits (i.e.
frees)? Or something else?

FWIW, I'm also wondering if this lends itself to some form of batching
for if we get to the point of relogging a large number of small items.
For example, consider a single dedicated/relogging transaction for many
small (or related) intents vs. N independent transactions processing in
the background. This is something that could came later once the
fundamentals are worked out, however.

> On notification of the intent being logged, the relogging work
> (which already has a transaction and a reservation) can be dumped to
> a workqueue (to get it out of iclog completion context) and the item
> relogged and the transaction rolled and re-reserved again.
> 
> This would work with any type of log item, not just intents, and it
> doesn't have any reservation "stealing" requirements. ANd because
> we've pre-reserved the log space for the relogging transaction, the
> relogging callback will never get hung up on it's own items
> preventing it from getting more log space.
> 
> i.e. we essentially make use of the existing relogging mechanism,
> just with callbacks to trigger periodic relogging of the items for
> long running transactions...
> 
> Thoughts?
> 

All in all this sounds interesting. I still need to grok the transaction
reservation/ownership semantics you propose re: the questions above, but
I do like the prospect of reusing existing mechanisms and thus being
able to more easily handle generic log items. Thanks for the
feedback...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx