Re: [PATCH 11/11 v2] xfs: limit iclog tail updates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 28, 2021 at 09:44:10AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> From the department of "generic/482 keeps on giving", we bring you
> another tail update race condition:
> 
> iclog:
> 	S1			C1
> 	+-----------------------+-----------------------+
> 				 S2			EOIC
> 
> Two checkpoints in a single iclog. One is complete, the other just
> contains the start record and overruns into a new iclog.
> 
> Timeline:
> 
> Before S1:	Cache flush, log tail = X
> At S1:		Metadata stable, write start record and checkpoint
> At C1:		Write commit record, set NEED_FUA
> 		Single iclog checkpoint, so no need for NEED_FLUSH
> 		Log tail still = X, so no need for NEED_FLUSH
> 
> After C1,
> Before S2:	Cache flush, log tail = X
> At S2:		Metadata stable, write start record and checkpoint
> After S2:	Log tail moves to X+1
> At EOIC:	End of iclog, more journal data to write
> 		Releases iclog
> 		Not a commit iclog, so no need for NEED_FLUSH
> 		Writes log tail X+1 into iclog.
> 
> At this point, the iclog has tail X+1 and NEED_FUA set. There has
> been no cache flush for the metadata between X and X+1, and the
> iclog writes the new tail permanently to the log. THis is sufficient
> to violate on disk metadata/journal ordering.
> 
> We have two options here. The first is to detect this case in some
> manner and ensure that the partial checkpoint write sets NEED_FLUSH
> when the iclog is already marked NEED_FUA and the log tail changes.
> This seems somewhat fragile and quite complex to get right, and it
> doesn't actually make it obvious what underlying problem it is
> actually addressing from reading the code.
> 
> The second option seems much cleaner to me, because it is derived
> directly from the requirements of the C1 commit record in the iclog.
> That is, when we write this commit record to the iclog, we've
> guaranteed that the metadata/data ordering is correct for tail
> update purposes. Hence if we only write the log tail into the iclog
> for the *first* commit record rather than the log tail at the last
> release, we guarantee that the log tail does not move past where the
> the first commit record in the log expects it to be.
> 
> IOWs, taking the first option means that replay of C1 becomes
> dependent on future operations doing the right thing, not just the
> C1 checkpoint itself doing the right thing. This makes log recovery
> almost impossible to reason about because now we have to take into
> account what might or might not have happened in the future when
> looking at checkpoints in the log rather than just having to
> reconstruct the past...
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>

Looks good now,
Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

> ---
> Version 2:
> - fix xlog_verify_tail_lsn() debug code to look at the iclog tail
>   lsn rather than the current tail lsn that we might not have
>   written into the iclog.
> 
>  fs/xfs/xfs_log.c | 50 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 37 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 1c328efdca66..60ac5fd63f1e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -78,13 +78,12 @@ xlog_verify_iclog(
>  STATIC void
>  xlog_verify_tail_lsn(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	xfs_lsn_t		tail_lsn);
> +	struct xlog_in_core	*iclog);
>  #else
>  #define xlog_verify_dest_ptr(a,b)
>  #define xlog_verify_grant_tail(a)
>  #define xlog_verify_iclog(a,b,c)
> -#define xlog_verify_tail_lsn(a,b,c)
> +#define xlog_verify_tail_lsn(a,b)
>  #endif
>  
>  STATIC int
> @@ -489,12 +488,31 @@ xfs_log_reserve(
>  
>  /*
>   * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state.  If the caller passes in a non-zero
> - * @old_tail_lsn and the current log tail does not match, there may be metadata
> - * on disk that must be persisted before this iclog is written.  To satisfy that
> - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> - * iclog with the new log tail value.
> + * it is in the WANT_SYNC state.
> + *
> + * If the caller passes in a non-zero @old_tail_lsn and the current log tail
> + * does not match, there may be metadata on disk that must be persisted before
> + * this iclog is written.  To satisfy that requirement, set the
> + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
> + * log tail value.
> + *
> + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
> + * log tail is updated correctly. NEED_FUA indicates that the iclog will be
> + * written to stable storage, and implies that a commit record is contained
> + * within the iclog. We need to ensure that the log tail does not move beyond
> + * the tail that the first commit record in the iclog ordered against, otherwise
> + * correct recovery of that checkpoint becomes dependent on future operations
> + * performed on this iclog.
> + *
> + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
> + * current tail into iclog. Once the iclog tail is set, future operations must
> + * not modify it, otherwise they potentially violate ordering constraints for
> + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
> + * the iclog will get zeroed on activation of the iclog after sync, so we
> + * always capture the tail lsn on the iclog on the first NEED_FUA release
> + * regardless of the number of active reference counts on this iclog.
>   */
> +
>  int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
> @@ -519,6 +537,10 @@ xlog_state_release_iclog(
>  
>  		if (old_tail_lsn && tail_lsn != old_tail_lsn)
>  			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +
> +		if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
> +		    !iclog->ic_header.h_tail_lsn)
> +			iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  	}
>  
>  	if (!atomic_dec_and_test(&iclog->ic_refcnt))
> @@ -530,8 +552,9 @@ xlog_state_release_iclog(
>  	}
>  
>  	iclog->ic_state = XLOG_STATE_SYNCING;
> -	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> -	xlog_verify_tail_lsn(log, iclog, tail_lsn);
> +	if (!iclog->ic_header.h_tail_lsn)
> +		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> +	xlog_verify_tail_lsn(log, iclog);
>  	trace_xlog_iclog_syncing(iclog, _RET_IP_);
>  
>  	spin_unlock(&log->l_icloglock);
> @@ -2579,6 +2602,7 @@ xlog_state_activate_iclog(
>  	memset(iclog->ic_header.h_cycle_data, 0,
>  		sizeof(iclog->ic_header.h_cycle_data));
>  	iclog->ic_header.h_lsn = 0;
> +	iclog->ic_header.h_tail_lsn = 0;
>  }
>  
>  /*
> @@ -3614,10 +3638,10 @@ xlog_verify_grant_tail(
>  STATIC void
>  xlog_verify_tail_lsn(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	xfs_lsn_t		tail_lsn)
> +	struct xlog_in_core	*iclog)
>  {
> -    int blocks;
> +	xfs_lsn_t	tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn);
> +	int		blocks;
>  
>      if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
>  	blocks =



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux