On Mon, Jun 22, 2015 at 02:46:09PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > In recent testing, a system that crashed failed log recovery on > restart with a bad symlink buffer magic number: > > XFS (vda): Starting recovery (logdev: internal) > XFS (vda): Bad symlink block magic! > XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060 > > On examination of the log via xfs_logprint, none of the symlink > buffers in the log had a bad magic number, nor were any other types > of buffer log format headers mis-identified as symlink buffers. > Tracing was used to find the buffer the kernel was tripping over, > and xfs_db identified it's contents as: > > 000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e > 020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d > 040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d > 060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d > ..... > > This is a remote attribute buffer, which are notable in that they > are not logged but are instead written synchronously by the remote > attribute code so that they exist on disk before the attribute > transactions are committed to the journal. > > The above remote attribute block has an invalid LSN in it - cycle > 0xd002000, block 0 - which means when log recovery comes along to > determine if the transaction that writes to the underlying block > should be replayed, it sees a block that has a future LSN and so > does not replay the buffer data in the transaction. Instead, it > validates the buffer magic number and attaches the buffer verifier > to it. It is this buffer magic number check that is failing in the > above assert, indicating that we skipped replay due to the LSN of > the underlying buffer. > > The problem here is that the remote attribute buffers cannot have a > valid LSN placed into them, because the transaction that contains > the attribute tree pointer changes and the block allocation that the > attribute data is being written to hasn't yet been committed. Hence > the LSN field int eh attribute block is completely unwritten, > thereby leaving the underlying contents of the block in the LSN > field. It could have any value, and hence a future overwrite of the > block by log recovery may or may not work correctly. > > Fix this by always writing an invalid LSN to the remote attribute > block, as any buffer in log recovery that needs to write over the > remote attribute should occur. We are protected from having old data > written over the attribute by the fact that freeing the block before > the remote attribute is written will result in the buffer being > marked stale in the log and so all changes prior to the buffer stale > transaction will be cancelled by log recovery. > > Hence it is safe to ignore the LSN in the case or synchronously > written, unlogged metadata such as remote attribute blocks, and to > ensure we do that correctly, we need to write an invalid LSN to all > remote attribute blocks to trigger immediate recovery of metadata > that is written over the top. > > As a further protection for filesystems that may already have remote > attribute blocks with bad LSNs on disk, change the log recovery code > to always trigger immediate recovery of metadata over remote > attribute blocks. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_attr_remote.c | 30 ++++++++++++++++++++++++------ > fs/xfs/xfs_log_recover.c | 11 ++++++++--- > 2 files changed, 32 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c > index 20de88d..3854439 100644 > --- a/fs/xfs/libxfs/xfs_attr_remote.c > +++ b/fs/xfs/libxfs/xfs_attr_remote.c > @@ -159,11 +159,10 @@ xfs_attr3_rmt_write_verify( > struct xfs_buf *bp) > { > struct xfs_mount *mp = bp->b_target->bt_mount; > - struct xfs_buf_log_item *bip = bp->b_fspriv; > + int blksize = mp->m_attr_geo->blksize; > char *ptr; > int len; > xfs_daddr_t bno; > - int blksize = mp->m_attr_geo->blksize; > > /* no verification of non-crc buffers */ > if (!xfs_sb_version_hascrc(&mp->m_sb)) > @@ -175,16 +174,22 @@ xfs_attr3_rmt_write_verify( > ASSERT(len >= blksize); > > while (len > 0) { > + struct xfs_attr3_rmt_hdr *rmt = (struct xfs_attr3_rmt_hdr *)ptr; > + > if (!xfs_attr3_rmt_verify(mp, ptr, blksize, bno)) { > xfs_buf_ioerror(bp, -EFSCORRUPTED); > xfs_verifier_error(bp); > return; > } > - if (bip) { > - struct xfs_attr3_rmt_hdr *rmt; > > - rmt = (struct xfs_attr3_rmt_hdr *)ptr; > - rmt->rm_lsn = cpu_to_be64(bip->bli_item.li_lsn); > + /* > + * Ensure we aren't writing bogus LSNs to disk. See > + * xfs_attr3_rmt_hdr_set() for the explanation. > + */ > + if (rmt->rm_lsn != cpu_to_be64(NULLCOMMITLSN)) { > + xfs_buf_ioerror(bp, -EFSCORRUPTED); > + xfs_verifier_error(bp); > + return; > } > xfs_update_cksum(ptr, blksize, XFS_ATTR3_RMT_CRC_OFF); > > @@ -221,6 +226,19 @@ xfs_attr3_rmt_hdr_set( > rmt->rm_owner = cpu_to_be64(ino); > rmt->rm_blkno = cpu_to_be64(bno); > > + /* > + * Remote attribute blocks are written synchronously, so we don't > + * have an LSN that we can stamp in them that makes any sense to log > + * recovery. To ensure that log recovery handles overwrites of these > + * blocks sanely (i.e. once they've been freed and reallocated as some > + * other type of metadata) we need to ensure that the LSN has a value > + * that tells log recovery to ignore the LSN and overwrite the buffer > + * with whatever is in it's log. To do this, we use the magic > + * NULLCOMMITLSN to indicate that the LSN is invalid. > + * of -1. Looks like a stale last line in the comment above. The rest of the code all looks fine to me, but a question to help me grok the bigger picture: We allocate these attr blocks in xfs_attr_rmtval_set(). If the writes are synchronous, shouldn't those allocation transactions be synchronous as well? IOW, what prevents our synchronous writes going to freed blocks or blocks allocated to something else (if a block were freed, reallocated as an attr block and sync. written without making it to the ail) due to a crash immediately following the sync. write? Brian > + */ > + rmt->rm_lsn = cpu_to_be64(NULLCOMMITLSN); > + > return sizeof(struct xfs_attr3_rmt_hdr); > } > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 01dd228..480ebba 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -1886,9 +1886,14 @@ xlog_recover_get_buf_lsn( > uuid = &((struct xfs_dir3_blk_hdr *)blk)->uuid; > break; > case XFS_ATTR3_RMT_MAGIC: > - lsn = be64_to_cpu(((struct xfs_attr3_rmt_hdr *)blk)->rm_lsn); > - uuid = &((struct xfs_attr3_rmt_hdr *)blk)->rm_uuid; > - break; > + /* > + * Remote attr blocks are written synchronously, rather than > + * being logged. That means they do not contain a valid LSN > + * (i.e. transactionally ordered) in them, and hence any time we > + * see a buffer to replay over the top of a remote attribute > + * block we should simply do so. > + */ > + goto recover_immediately; > case XFS_SB_MAGIC: > lsn = be64_to_cpu(((struct xfs_dsb *)blk)->sb_lsn); > uuid = &((struct xfs_dsb *)blk)->sb_uuid; > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs