Re: [PATCH] xfs: remote attribute headers contain an invalid LSN

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 24 Jun 2015 09:12:28 -0400

On Mon, Jun 22, 2015 at 02:46:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> In recent testing, a system that crashed failed log recovery on
> restart with a bad symlink buffer magic number:
> 
> XFS (vda): Starting recovery (logdev: internal)
> XFS (vda): Bad symlink block magic!
> XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060
> 
> On examination of the log via xfs_logprint, none of the symlink
> buffers in the log had a bad magic number, nor were any other types
> of buffer log format headers mis-identified as symlink buffers.
> Tracing was used to find the buffer the kernel was tripping over,
> and xfs_db identified it's contents as:
> 
> 000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
> 020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
> 040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
> 060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
> .....
> 
> This is a remote attribute buffer, which are notable in that they
> are not logged but are instead written synchronously by the remote
> attribute code so that they exist on disk before the attribute
> transactions are committed to the journal.
> 
> The above remote attribute block has an invalid LSN in it - cycle
> 0xd002000, block 0 - which means when log recovery comes along to
> determine if the transaction that writes to the underlying block
> should be replayed, it sees a block that has a future LSN and so
> does not replay the buffer data in the transaction. Instead, it
> validates the buffer magic number and attaches the buffer verifier
> to it.  It is this buffer magic number check that is failing in the
> above assert, indicating that we skipped replay due to the LSN of
> the underlying buffer.
> 
> The problem here is that the remote attribute buffers cannot have a
> valid LSN placed into them, because the transaction that contains
> the attribute tree pointer changes and the block allocation that the
> attribute data is being written to hasn't yet been committed. Hence
> the LSN field int eh attribute block is completely unwritten,
> thereby leaving the underlying contents of the block in the LSN
> field. It could have any value, and hence a future overwrite of the
> block by log recovery may or may not work correctly.
> 
> Fix this by always writing an invalid LSN to the remote attribute
> block, as any buffer in log recovery that needs to write over the
> remote attribute should occur. We are protected from having old data
> written over the attribute by the fact that freeing the block before
> the remote attribute is written will result in the buffer being
> marked stale in the log and so all changes prior to the buffer stale
> transaction will be cancelled by log recovery.
> 
> Hence it is safe to ignore the LSN in the case or synchronously
> written, unlogged metadata such as remote attribute blocks, and to
> ensure we do that correctly, we need to write an invalid LSN to all
> remote attribute blocks to trigger immediate recovery of metadata
> that is written over the top.
> 
> As a further protection for filesystems that may already have remote
> attribute blocks with bad LSNs on disk, change the log recovery code
> to always trigger immediate recovery of metadata over remote
> attribute blocks.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/libxfs/xfs_attr_remote.c | 30 ++++++++++++++++++++++++------
>  fs/xfs/xfs_log_recover.c        | 11 ++++++++---
>  2 files changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
> index 20de88d..3854439 100644
> --- a/fs/xfs/libxfs/xfs_attr_remote.c
> +++ b/fs/xfs/libxfs/xfs_attr_remote.c
> @@ -159,11 +159,10 @@ xfs_attr3_rmt_write_verify(
>  	struct xfs_buf	*bp)
>  {
>  	struct xfs_mount *mp = bp->b_target->bt_mount;
> -	struct xfs_buf_log_item	*bip = bp->b_fspriv;
> +	int		blksize = mp->m_attr_geo->blksize;
>  	char		*ptr;
>  	int		len;
>  	xfs_daddr_t	bno;
> -	int		blksize = mp->m_attr_geo->blksize;
>  
>  	/* no verification of non-crc buffers */
>  	if (!xfs_sb_version_hascrc(&mp->m_sb))
> @@ -175,16 +174,22 @@ xfs_attr3_rmt_write_verify(
>  	ASSERT(len >= blksize);
>  
>  	while (len > 0) {
> +		struct xfs_attr3_rmt_hdr *rmt = (struct xfs_attr3_rmt_hdr *)ptr;
> +
>  		if (!xfs_attr3_rmt_verify(mp, ptr, blksize, bno)) {
>  			xfs_buf_ioerror(bp, -EFSCORRUPTED);
>  			xfs_verifier_error(bp);
>  			return;
>  		}
> -		if (bip) {
> -			struct xfs_attr3_rmt_hdr *rmt;
>  
> -			rmt = (struct xfs_attr3_rmt_hdr *)ptr;
> -			rmt->rm_lsn = cpu_to_be64(bip->bli_item.li_lsn);
> +		/*
> +		 * Ensure we aren't writing bogus LSNs to disk. See
> +		 * xfs_attr3_rmt_hdr_set() for the explanation.
> +		 */
> +		if (rmt->rm_lsn != cpu_to_be64(NULLCOMMITLSN)) {
> +			xfs_buf_ioerror(bp, -EFSCORRUPTED);
> +			xfs_verifier_error(bp);
> +			return;
>  		}
>  		xfs_update_cksum(ptr, blksize, XFS_ATTR3_RMT_CRC_OFF);
>  
> @@ -221,6 +226,19 @@ xfs_attr3_rmt_hdr_set(
>  	rmt->rm_owner = cpu_to_be64(ino);
>  	rmt->rm_blkno = cpu_to_be64(bno);
>  
> +	/*
> +	 * Remote attribute blocks are written synchronously, so we don't
> +	 * have an LSN that we can stamp in them that makes any sense to log
> +	 * recovery. To ensure that log recovery handles overwrites of these
> +	 * blocks sanely (i.e. once they've been freed and reallocated as some
> +	 * other type of metadata) we need to ensure that the LSN has a value
> +	 * that tells log recovery to ignore the LSN and overwrite the buffer
> +	 * with whatever is in it's log. To do this, we use the magic
> +	 * NULLCOMMITLSN to indicate that the LSN is invalid.
> +	 * of -1.

Looks like a stale last line in the comment above. The rest of the code
all looks fine to me, but a question to help me grok the bigger picture:

We allocate these attr blocks in xfs_attr_rmtval_set(). If the writes
are synchronous, shouldn't those allocation transactions be synchronous
as well? IOW, what prevents our synchronous writes going to freed blocks
or blocks allocated to something else (if a block were freed,
reallocated as an attr block and sync. written without making it to the
ail) due to a crash immediately following the sync. write?

Brian

> +	 */
> +	rmt->rm_lsn = cpu_to_be64(NULLCOMMITLSN);
> +
>  	return sizeof(struct xfs_attr3_rmt_hdr);
>  }
>  
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 01dd228..480ebba 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -1886,9 +1886,14 @@ xlog_recover_get_buf_lsn(
>  		uuid = &((struct xfs_dir3_blk_hdr *)blk)->uuid;
>  		break;
>  	case XFS_ATTR3_RMT_MAGIC:
> -		lsn = be64_to_cpu(((struct xfs_attr3_rmt_hdr *)blk)->rm_lsn);
> -		uuid = &((struct xfs_attr3_rmt_hdr *)blk)->rm_uuid;
> -		break;
> +		/*
> +		 * Remote attr blocks are written synchronously, rather than
> +		 * being logged. That means they do not contain a valid LSN
> +		 * (i.e. transactionally ordered) in them, and hence any time we
> +		 * see a buffer to replay over the top of a remote attribute
> +		 * block we should simply do so.
> +		 */
> +		goto recover_immediately;
>  	case XFS_SB_MAGIC:
>  		lsn = be64_to_cpu(((struct xfs_dsb *)blk)->sb_lsn);
>  		uuid = &((struct xfs_dsb *)blk)->sb_uuid;
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs