Re: [PATCH 8/8] xfs: intent item whiteouts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 27, 2022 at 12:22:59PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> When we log modifications based on intents, we add both intent
> and intent done items to the modification being made. These get
> written to the log to ensure that the operation is re-run if the
> intent done is not found in the log.
> 
> However, for operations that complete wholly within a single
> checkpoint, the change in the checkpoint is atomic and will never
> need replay. In this case, we don't need to actually write the
> intent and intent done items to the journal because log recovery
> will never need to manually restart this modification.
> 
> Log recovery currently handles intent/intent done matching by
> inserting the intent into the AIL, then removing it when a matching
> intent done item is found. Hence for all the intent-based operations
> that complete within a checkpoint, we spend all that time parsing
> the intent/intent done items just to cancel them and do nothing with
> them.
> 
> Hence it follows that the only time we actually need intents in the
> log is when the modification crosses checkpoint boundaries in the
> log and so may only be partially complete in the journal. Hence if
> we commit and intent done item to the CIL and the intent item is in
> the same checkpoint, we don't actually have to write them to the
> journal because log recovery will always cancel the intents.
> 
> We've never really worried about the overhead of logging intents
> unnecessarily like this because the intents we log are generally
> very much smaller than the change being made. e.g. freeing an extent
> involves modifying at lease two freespace btree blocks and the AGF,
> so the EFI/EFD overhead is only a small increase in space and
> processing time compared to the overall cost of freeing an extent.

Question: If we whiteout enough intent items, does that make it possible
to cram more updates into a checkpoint?

Are changes required to the existing intent item code to support
whiteouts, or does the log code autodetect an *I/*D pair in the same
checkpoint and elide them automatically?

(I might be fishing here for "Does this make generic/447 faster when it
deletes the million extents?")

> However, delayed attributes change this cost equation dramatically,
> especially for inline attributes. In the case of adding an inline
> attribute, we only log the inode core and attribute fork at present.
> With delayed attributes, we now log the attr intent which includes
> the name and value, the inode core adn attr fork, and finally the
> attr intent done item. We increase the number of items we log from 1
> to 3, and the number of log vectors (regions) goes up from 3 to 7.
> Hence we tripple the number of objects that the CIL has to process,
> and more than double the number of log vectors that need to be
> written to the journal.
> 
> At scale, this means delayed attributes cause a non-pipelined CIL to
> become CPU bound processing all the extra items, resulting in a > 40%
> performance degradation on 16-way file+xattr create worklaods.
> Pipelining the CIL (as per 5.15) reduces the performance degradation
> to 20%, but now the limitation is the rate at which the log items
> can be written to the iclogs and iclogs be dispatched for IO and
> completed.
> 
> Even log IO completion is slowed down by these intents, because it
> now has to process 3x the number of items in the checkpoint.
> Processing completed intents is especially inefficient here, because
> we first insert the intent into the AIL, then remove it from the AIL
> when the intent done is processed. IOWs, we are also doing expensive
> operations in log IO completion we could completely avoid if we
> didn't log completed intent/intent done pairs.
> 
> Enter log item whiteouts.
> 
> When an intent done is committed, we can check to see if the
> associated intent is in the same checkpoint as we are currently
> committing the intent done to. If so, we can mark the intent log
> item with a whiteout and immediately free the intent done item
> rather than committing it to the CIL. We can basically skip the
> entire formatting and CIL insertion steps for the intent done item.
> 
> However, we cannot remove the intent item from the CIL at this point
> because the unlocked per-cpu CIL item lists do not permit removal
> without holding the CIL context lock exclusively. Transaction commit
> only holds the context lock shared, hence the best we can do is mark
> the intent item with a whiteout so that the CIL push can release it
> rather than writing it to the log.
> 
> This means we never write the intent to the log if the intent done
> has also been committed to the same checkpoint, but we'll always
> write the intent if the intent done has not been committed or has
> been committed to a different checkpoint. This will result in
> correct log recovery behaviour in all cases, without the overhead of
> logging unnecessary intents.
> 
> This intent whiteout concept is generic - we can apply it to all
> intent/intent done pairs that have a direct 1:1 relationship. The
> way deferred ops iterate and relog intents mean that all intents
> currently have a 1:1 relationship with their done intent, and hence
> we can apply this cancellation to all existing intent/intent done
> implementations.
> 
> For delayed attributes with a 16-way 64kB xattr create workload,
> whiteouts reduce the amount of journalled metadata from ~2.5GB/s
> down to ~600MB/s and improve the creation rate from 9000/s to
> 14000/s.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_bmap_item.c     |  2 +
>  fs/xfs/xfs_log_cil.c       | 80 ++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_refcount_item.c |  1 +
>  fs/xfs/xfs_rmap_item.c     |  2 +
>  fs/xfs/xfs_trace.h         |  3 ++
>  fs/xfs/xfs_trans.h         |  6 ++-
>  6 files changed, 89 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 59aa5f9bf769..670d074a71cc 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -39,6 +39,7 @@ STATIC void
>  xfs_bui_item_free(
>  	struct xfs_bui_log_item	*buip)
>  {
> +	kmem_free(buip->bui_item.li_lv_shadow);

Why is it necessary for log items to free their own shadow buffer?

>  	kmem_cache_free(xfs_bui_cache, buip);
>  }
>  
> @@ -199,6 +200,7 @@ xfs_bud_item_release(
>  {
>  	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
>  
> +	kmem_free(budp->bud_item.li_lv_shadow);
>  	xfs_bui_release(budp->bud_buip);
>  	kmem_cache_free(xfs_bud_cache, budp);
>  }
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0d8d092447ad..d894511428f2 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -476,7 +476,8 @@ xlog_cil_insert_format_items(
>  static void
>  xlog_cil_insert_items(
>  	struct xlog		*log,
> -	struct xfs_trans	*tp)
> +	struct xfs_trans	*tp,
> +	uint32_t		released_space)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
> @@ -526,6 +527,7 @@ xlog_cil_insert_items(
>  	}
>  	tp->t_ticket->t_curr_res -= len;
>  	ctx->space_used += len;
> +	ctx->space_used -= released_space;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> @@ -970,11 +972,16 @@ xlog_cil_build_trans_hdr(
>   * Pull all the log vectors off the items in the CIL, and remove the items from
>   * the CIL. We don't need the CIL lock here because it's only needed on the
>   * transaction commit side which is currently locked out by the flush lock.
> + *
> + * If a log item is marked with a whiteout, we do not need to write it to the
> + * journal and so we just move them to the whiteout list for the caller to
> + * dispose of appropriately.
>   */
>  static void
>  xlog_cil_build_lv_chain(
>  	struct xfs_cil		*cil,
>  	struct xfs_cil_ctx	*ctx,
> +	struct list_head	*whiteouts,
>  	uint32_t		*num_iovecs,
>  	uint32_t		*num_bytes)
>  {
> @@ -985,6 +992,13 @@ xlog_cil_build_lv_chain(
>  
>  		item = list_first_entry(&cil->xc_cil,
>  					struct xfs_log_item, li_cil);
> +
> +		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> +			list_move(&item->li_cil, whiteouts);
> +			trace_xfs_cil_whiteout_skip(item);
> +			continue;
> +		}
> +
>  		list_del_init(&item->li_cil);
>  		if (!ctx->lv_chain)
>  			ctx->lv_chain = item->li_lv;
> @@ -1030,6 +1044,7 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_csn_t		push_seq;
>  	bool			push_commit_stable;
> +	LIST_HEAD		(whiteouts);
>  
>  	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1098,7 +1113,7 @@ xlog_cil_push_work(
>  	list_add(&ctx->committing, &cil->xc_committing);
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes);
> +	xlog_cil_build_lv_chain(cil, ctx, &whiteouts, &num_iovecs, &num_bytes);
>  
>  	/*
>  	 * Switch the contexts so we can drop the context lock and move out
> @@ -1201,6 +1216,15 @@ xlog_cil_push_work(
>  	/* Not safe to reference ctx now! */
>  
>  	spin_unlock(&log->l_icloglock);
> +
> +	/* clean up log items that had whiteouts */
> +	while (!list_empty(&whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(&whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
>  	return;
>  
>  out_skip:
> @@ -1212,6 +1236,14 @@ xlog_cil_push_work(
>  out_abort_free_ticket:
>  	xfs_log_ticket_ungrant(log, ctx->ticket);
>  	ASSERT(xlog_is_shutdown(log));
> +	while (!list_empty(&whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(&whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
> +
>  	if (!ctx->commit_iclog) {
>  		xlog_cil_committed(ctx);
>  		return;
> @@ -1360,6 +1392,43 @@ xlog_cil_empty(
>  	return empty;
>  }
>  
> +/*
> + * If there are intent done items in this transaction and the related intent was
> + * committed in the current (same) CIL checkpoint, we don't need to write either
> + * the intent or intent done item to the journal as the change will be
> + * journalled atomically within this checkpoint. As we cannot remove items from
> + * the CIL here, mark the related intent with a whiteout so that the CIL push
> + * can remove it rather than writing it to the journal. Then remove the intent
> + * done item from the current transaction and release it so it doesn't get put
> + * into the CIL at all.
> + */
> +static uint32_t
> +xlog_cil_process_intents(
> +	struct xfs_cil		*cil,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_log_item	*lip, *ilip, *next;
> +	uint32_t		len = 0;
> +
> +	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> +		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
> +			continue;
> +
> +		ilip = lip->li_ops->iop_intent(lip);
> +		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
> +			continue;
> +		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
> +		trace_xfs_cil_whiteout_mark(ilip);
> +		len += ilip->li_lv->lv_bytes;
> +		kmem_free(ilip->li_lv);
> +		ilip->li_lv = NULL;
> +
> +		xfs_trans_del_item(lip);
> +		lip->li_ops->iop_release(lip);
> +	}
> +	return len;
> +}
> +
>  /*
>   * Commit a transaction with the given vector to the Committed Item List.
>   *
> @@ -1382,6 +1451,7 @@ xlog_cil_commit(
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> +	uint32_t		released_space = 0;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1393,7 +1463,11 @@ xlog_cil_commit(
>  	/* lock out background commit */
>  	down_read(&cil->xc_ctx_lock);
>  
> -	xlog_cil_insert_items(log, tp);
> +	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
> +		released_space = xlog_cil_process_intents(cil, tp);
> +
> +	xlog_cil_insert_items(log, tp, released_space);
> +	tp->t_ticket->t_curr_res += released_space;

I'm a little tired, so why isn't this adjustment a part of
xlog_cil_insert_items?  A similar adjustment is made to
ctx->space_used to release the unused space back to the committing tx,
right?

--D

>  
>  	if (regrant && !xlog_is_shutdown(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 1b82b818f515..3fee1090e9a8 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -35,6 +35,7 @@ STATIC void
>  xfs_cui_item_free(
>  	struct xfs_cui_log_item	*cuip)
>  {
> +	kmem_free(cuip->cui_item.li_lv_shadow);
>  	if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS)
>  		kmem_free(cuip);
>  	else
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index 99dfb3ae7e9c..546bd824cdf7 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -35,6 +35,7 @@ STATIC void
>  xfs_rui_item_free(
>  	struct xfs_rui_log_item	*ruip)
>  {
> +	kmem_free(ruip->rui_item.li_lv_shadow);
>  	if (ruip->rui_format.rui_nextents > XFS_RUI_MAX_FAST_EXTENTS)
>  		kmem_free(ruip);
>  	else
> @@ -228,6 +229,7 @@ xfs_rud_item_release(
>  {
>  	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
>  
> +	kmem_free(rudp->rud_item.li_lv_shadow);
>  	xfs_rui_release(rudp->rud_ruip);
>  	kmem_cache_free(xfs_rud_cache, rudp);
>  }
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index e1197f9ad97e..75934e3c3f55 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
>  
>  DECLARE_EVENT_CLASS(xfs_ail_class,
>  	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index d72a5995d33e..9561f193e7e1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -55,13 +55,15 @@ struct xfs_log_item {
>  #define	XFS_LI_IN_AIL	0
>  #define	XFS_LI_ABORTED	1
>  #define	XFS_LI_FAILED	2
> -#define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
> +#define	XFS_LI_DIRTY	3
> +#define	XFS_LI_WHITEOUT	4
>  
>  #define XFS_LI_FLAGS \
>  	{ (1u << XFS_LI_IN_AIL),	"IN_AIL" }, \
>  	{ (1u << XFS_LI_ABORTED),	"ABORTED" }, \
>  	{ (1u << XFS_LI_FAILED),	"FAILED" }, \
> -	{ (1u << XFS_LI_DIRTY),		"DIRTY" }
> +	{ (1u << XFS_LI_DIRTY),		"DIRTY" }, \
> +	{ (1u << XFS_LI_WHITEOUT),	"WHITEOUT" }
>  
>  struct xfs_item_ops {
>  	unsigned flags;
> -- 
> 2.35.1
> 



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux