Re: [PATCH RFC] xfs: automatic log item relogging experiment

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 25 Oct 2019 09:43:08 +1100

On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> An experimental mechanism to automatically relog specified log
> items.  This is useful for long running operations, like quotaoff,
> that otherwise risk deadlocking on log space usage.
> 
> Not-signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> ---
> 
> Hi all,
> 
> This is an experiment that came out of a discussion with Darrick[1] on
> some design bits of the latest online btree repair work. Specifically,
> it logs free intents in the same transaction as block allocation to
> guard against inconsistency in the event of a crash during the repair
> sequence. These intents happen pin the log tail for an indeterminate
> amount of time. Darrick was looking for some form of auto relog
> mechanism to facilitate this general approach. It occurred to us that
> this is the same problem we've had with quotaoff for some time, so I
> figured it might be worth prototyping something against that to try and
> prove the concept.

Interesting idea. :)

> 
> Note that this is RFC because the code and interfaces are a complete
> mess and this is broken in certain ways. This occasionally triggers log
> reservation overrun shutdowns because transaction reservation checking
> has not yet been added, the cancellation path is overkill, etc. IOW, the
> purpose of this patch is purely to test a concept.

*nod*

> The concept is essentially to flag a log item for relogging on first
> transaction commit such that once it commits to the AIL, the next
> transaction that happens to commit with sufficient unused reservation
> opportunistically relogs the item to the current CIL context. For the
> log intent case, the transaction that commits the done item is required
> to cancel the relog state of the original intent to prevent further
> relogging.

Makes sense, but it seems like we removed the hook that would be
used by transactions to implement their own relogging on CIL commit
some time ago because nothign had used it for 15+ years....

> In practice, this allows a log item to automatically roll through CIL
> checkpoints and not pin the tail of the log while something like a
> quotaoff is running for a potentially long period of time. This is
> applied to quotaoff and focused testing shows that it avoids the
> associated deadlock.

Hmmm. How do we deal with multiple identical intents being found in
checkpoints with different LSNs in log recovery?

> Thoughts, reviews, flames appreciated.
> 
> [1] https://lore.kernel.org/linux-xfs/20191018143856.GA25763@bfoster/
> 
>  fs/xfs/xfs_log_cil.c     | 69 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_priv.h    |  6 ++++
>  fs/xfs/xfs_qm_syscalls.c | 13 ++++++++
>  fs/xfs/xfs_trace.h       |  2 ++
>  fs/xfs/xfs_trans.c       |  4 +++
>  fs/xfs/xfs_trans.h       |  4 ++-
>  6 files changed, 97 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index a1204424a938..b2d8b2d54df6 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -75,6 +75,33 @@ xlog_cil_iovec_space(
>  			sizeof(uint64_t));
>  }
>  
> +static void
> +xlog_cil_relog_items(
> +	struct xlog		*log,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_cil		*cil = log->l_cilp;
> +	struct xfs_log_item	*lip;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_DIRTY);
> +
> +	if (list_empty(&cil->xc_relog))
> +		return;
> +
> +	/* XXX: need to check trans reservation, process multiple items, etc. */
> +	spin_lock(&cil->xc_relog_lock);
> +	lip = list_first_entry_or_null(&cil->xc_relog, struct xfs_log_item, li_cil);
> +	if (lip)
> +		list_del_init(&lip->li_cil);
> +	spin_unlock(&cil->xc_relog_lock);
> +
> +	if (lip) {
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		trace_xfs_cil_relog(lip);
> +	}

I don't think this is safe - the log item needs to be locked to be
joined to a transaction. Hence this can race with whatever is
committing the done intent on the object and removing it from the
relog list and so the item could be clean (and potentially even
freed) by the time we go to add it to this transaction and mark it
dirty again...

> +}
> +
>  /*
>   * Allocate or pin log vector buffers for CIL insertion.
>   *
> @@ -1001,6 +1028,8 @@ xfs_log_commit_cil(
>  	struct xfs_log_item	*lip, *next;
>  	xfs_lsn_t		xc_commit_lsn;
>  
> +	xlog_cil_relog_items(log, tp);

Hmmm. This means that when there are relog items on the list, all
the transactional concurrency is going to be put onto the
xc_relog_lock spin lock. THis is potentially a major lock contention
point, especially if there are lots of items that need relogging.

> +
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
>  	 * This ensures the allocation does not deadlock with a CIL
> @@ -1196,6 +1225,8 @@ xlog_cil_init(
>  	spin_lock_init(&cil->xc_push_lock);
>  	init_rwsem(&cil->xc_ctx_lock);
>  	init_waitqueue_head(&cil->xc_commit_wait);
> +	INIT_LIST_HEAD(&cil->xc_relog);
> +	spin_lock_init(&cil->xc_relog_lock);
>  
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> @@ -1223,3 +1254,41 @@ xlog_cil_destroy(
>  	kmem_free(log->l_cilp);
>  }
>  
> +void
> +xfs_cil_relog_item(
> +	struct xlog		*log,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_cil		*cil = log->l_cilp;
> +
> +	ASSERT(test_bit(XFS_LI_RELOG, &lip->li_flags));
> +	ASSERT(list_empty(&lip->li_cil));

So this can't be used for things like inodes and buffers?

> +	spin_lock(&cil->xc_relog_lock);
> +	list_add_tail(&lip->li_cil, &cil->xc_relog);
> +	spin_unlock(&cil->xc_relog_lock);
> +
> +	trace_xfs_cil_relog_queue(lip);
> +}
> +
> +bool
> +xfs_cil_relog_steal(
> +	struct xlog		*log,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_cil		*cil = log->l_cilp;
> +	struct xfs_log_item	*pos, *ppos;
> +	bool			ret = false;
> +
> +	spin_lock(&cil->xc_relog_lock);
> +	list_for_each_entry_safe(pos, ppos, &cil->xc_relog, li_cil) {
> +		if (pos == lip) {
> +			list_del_init(&pos->li_cil);
> +			ret = true;
> +			break;
> +		}
> +	}
> +	spin_unlock(&cil->xc_relog_lock);

This is a remove operation, not a "steal" operation. But why are
we walking the relog list to find it? It would be much better to use
a flag to indicate what list the item is on, and then this just
becomes

	spin_lock(&cil->xc_relog_lock);
	if (test_and_clear_bit(XFS_LI_RELOG_LIST, &lip->li_flags))
		list_del_init(&pos->li_cil);
	spin_unlock(&cil->xc_relog_lock);



> @@ -556,6 +558,16 @@ xfs_qm_log_quotaoff_end(
>  					flags & XFS_ALL_QUOTA_ACCT);
>  	xfs_trans_log_quotaoff_item(tp, qoffi);
>  
> +	/*
> +	 * XXX: partly open coded relog of the start item to ensure no relogging
> +	 * after this point.
> +	 */
> +	clear_bit(XFS_LI_RELOG, &startqoff->qql_item.li_flags);
> +	if (xfs_cil_relog_steal(mp->m_log, &startqoff->qql_item)) {
> +		xfs_trans_add_item(tp, &startqoff->qql_item);
> +		xfs_trans_log_quotaoff_item(tp, startqoff);
> +	}

Urk. :)

> @@ -863,6 +864,9 @@ xfs_trans_committed_bulk(
>  		if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
>  			continue;
>  
> +		if (test_bit(XFS_LI_RELOG, &lip->li_flags))
> +			xfs_cil_relog_item(lip->li_mountp->m_log, lip);

Ok, so this is moving the item from commit back onto the relog list.
This is going to hammer the relog lock on workloads where there is a
lot of transaction concurrency and a substantial number of items on
the relog list....

----

Ok, so I mentioned that we removed the hooks that could have been
used for this some time ago.

What we actually want here is a notification that an object needs
relogging. I can see how appealing the concept of automatically
relogging is, but I'm unconvinced that we can make it work,
especially when there aren't sufficient reservations to relog
the items that need relogging.

commit d420e5c810bce5debce0238021b410d0ef99cf08
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Tue Oct 15 09:17:53 2013 +1100

    xfs: remove unused transaction callback variables
    
    We don't do callbacks at transaction commit time, no do we have any
    infrastructure to set up or run such callbacks, so remove the
    variables and typedefs for these operations. If we ever need to add
    callbacks, we can reintroduce the variables at that time.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Ben Myers <bpm@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 09cf40b89e8c..71c835e9e810 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -85,18 +85,11 @@ struct xfs_item_ops {
 #define XFS_ITEM_LOCKED                2
 #define XFS_ITEM_FLUSHING      3
 
-/*
- * This is the type of function which can be given to xfs_trans_callback()
- * to be called upon the transaction's commit to disk.
- */
-typedef void (*xfs_trans_callback_t)(struct xfs_trans *, void *);
-
 /*
  * This is the structure maintained for every active transaction.
  */
 typedef struct xfs_trans {
        unsigned int            t_magic;        /* magic number */
-       xfs_log_callback_t      t_logcb;        /* log callback struct */
        unsigned int            t_type;         /* transaction type */
        unsigned int            t_log_res;      /* amt of log space resvd */
        unsigned int            t_log_count;    /* count for perm log res */

That's basically the functionality we want here - when the log item
hits the journal, we want a callback to tell us so we can relog it
ourselves if deemed necessary. i.e. it's time to reintroduce the
transaction/log commit callback infrastructure...

This would get used in conjunction with a permanent transaction
reservation, allowing the owner of the object to keep it locked over
transaction commit (while whatever work is running between intent
and done), and the transaction commit turns into a trans_roll. Then
we reserve space for the next relogging commit, and go about our
business.

On notification of the intent being logged, the relogging work
(which already has a transaction and a reservation) can be dumped to
a workqueue (to get it out of iclog completion context) and the item
relogged and the transaction rolled and re-reserved again.

This would work with any type of log item, not just intents, and it
doesn't have any reservation "stealing" requirements. ANd because
we've pre-reserved the log space for the relogging transaction, the
relogging callback will never get hung up on it's own items
preventing it from getting more log space.

i.e. we essentially make use of the existing relogging mechanism,
just with callbacks to trigger periodic relogging of the items for
long running transactions...

Thoughts?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx