Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Wed, 23 Sep 2020 08:48:16 +0100

On Wed, Sep 16, 2020 at 08:29:07PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> 
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items.  Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
> 
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items.  After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
> 
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
> 
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent.  Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.

Wow, that is a massive change, but I really like the direction.

> +int
>  xfs_defer_capture(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_capture	**dfcp)
>  {
> +	struct xfs_defer_capture	*dfc = NULL;
> +
> +	if (!list_empty(&tp->t_dfops)) {
> +		dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
> +
> +		xfs_defer_create_intents(tp);
> +
> +		INIT_LIST_HEAD(&dfc->dfc_list);
> +		INIT_LIST_HEAD(&dfc->dfc_dfops);
> +
> +		/* Move the dfops chain and transaction state to the freezer. */
> +		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
> +		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> +		xfs_defer_reset(tp);
> +	}
> +
> +	*dfcp = dfc;
> +	return xfs_trans_commit(tp);

Don't we need to free the structure if xfs_trans_commit fails?

Wouldn't a saner calling convention would to pass the list_head into
->iop_recover and this function, and just add it to the list here
instead of passing it around as an output pointer argument.

> +/*
> + * Freeze any deferred ops and commit the transaction.  This is the last step
> + * needed to finish a log intent item that we recovered from the log.
> + */
> +int
> +xlog_recover_trans_commit(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_capture	**dfcp)
> +{
> +	return xfs_defer_capture(tp, dfcp);
> +}

I don't think this wrapper is very helpful.  I think a direct call
to xfs_defer_capture captures the intent better than pretending it just
is another commit.

> +
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> +	struct xfs_mount	*mp,
> +	struct list_head	*dfops_freezers)
>  {
> +	struct xfs_defer_capture *dfc, *next;
>  	struct xfs_trans	*tp;
>  	int64_t			freeblks;
> +	uint64_t		resblks;
> +	int			error = 0;
>  
> +	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
> +		/*
> +		 * We're finishing the defer_ops that accumulated as a result
> +		 * of recovering unfinished intent items during log recovery.
> +		 * We reserve an itruncate transaction because it is the
> +		 * largest permanent transaction type.  Since we're the only
> +		 * user of the fs right now, take 93% (15/16) of the available
> +		 * free blocks.  Use weird math to avoid a 64-bit division.
> +		 */
> +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> +		if (freeblks <= 0) {
> +			error = -ENOSPC;
> +			break;
> +		}
>  
> +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> +		resblks = (resblks * 15) >> 4;
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> +				0, XFS_TRANS_RESERVE, &tp);

This isn't actually new, but how did we end up with the truncate
reservation here?

> +		if (error)
> +			break;
> +
> +		/* transfer all collected dfops to this transaction */
> +		list_del_init(&dfc->dfc_list);
> +		xfs_defer_continue(dfc, tp);
> +
> +		error = xfs_trans_commit(tp);
> +		xfs_defer_capture_free(mp, dfc);
> +		if (error)
> +			break;

I'd just do direct returns and return 0 outside the loop, but that is
just nitpicking.

>  /* Is this log item a deferred action intent? */
> @@ -2528,28 +2566,16 @@ STATIC int
>  xlog_recover_process_intents(
>  	struct xlog		*log)
>  {
> +	LIST_HEAD(dfops_freezers);
>  	struct xfs_ail_cursor	cur;
> +	struct xfs_defer_capture *freezer = NULL;
>  	struct xfs_log_item	*lip;
>  	struct xfs_ail		*ailp;
> +	int			error = 0;
>  #if defined(DEBUG) || defined(XFS_WARN)
>  	xfs_lsn_t		last_lsn;
>  #endif
>  
>  	ailp = log->l_ailp;
>  	spin_lock(&ailp->ail_lock);
>  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> @@ -2578,26 +2604,31 @@ xlog_recover_process_intents(
>  
>  		/*
>  		 * NOTE: If your intent processing routine can create more
> +		 * deferred ops, you /must/ attach them to the freezer in
>  		 * this routine or else those subsequent intents will get
>  		 * replayed in the wrong order!
>  		 */
>  		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
>  			spin_unlock(&ailp->ail_lock);
> +			error = lip->li_ops->iop_recover(lip, &freezer);
>  			spin_lock(&ailp->ail_lock);
>  		}
> +		if (freezer) {
> +			list_add_tail(&freezer->dfc_list, &dfops_freezers);
> +			freezer = NULL;
> +		}
>  		if (error)
> +			break;
> +
>  		lip = xfs_trans_ail_cursor_next(ailp, &cur);

The code flow looks really very odd (though correct) to me, what do you
think of something like:

  	spin_lock(&ailp->ail_lock);
	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
	     lip;
	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
			continue;

		spin_unlock(&ailp->ail_lock);
		error = lip->li_ops->iop_recover(lip, &dfops_freezers);
		spin_lock(&ailp->ail_lock);
		if (error)
			break;
	}
  	xfs_trans_ail_cursor_done(&cur);
  	spin_unlock(&ailp->ail_lock);