On Fri, Sep 25, 2015 at 08:24:29AM -0400, Brian Foster wrote: > Since the onset of v5 superblocks, the LSN of the last modification has > been included in a variety of on-disk data structures. This LSN is used > to provide log recovery ordering guarantees (e.g., to ensure an older > log recovery item is not replayed over a newer target data structure). > > While this works correctly from the point a filesystem is formatted and > mounted, userspace tools have some problematic behaviors that defeat > this mechanism. For example, xfs_repair historically zeroes out the log > unconditionally (regardless of whether corruption is detected). If this > occurs, the LSN of the filesystem is reset and the log is now in a > problematic state with respect to on-disk metadata structures that might > have a larger LSN. Until either the log catches up to the highest > previously used metadata LSN or each affected data structure is modified > and written out without incident (which resets the metadata LSN), log > recovery is susceptible to filesystem corruption. > > This problem is ultimately addressed and repaired in the associated > userspace tools. The kernel is still responsible to detect the problem > and notify the user that something is wrong. Check the superblock LSN at > mount time and fail the mount if it is invalid. From that point on, > trigger verifier failure on any metadata I/O where an invalid LSN is > detected. This results in a filesystem shutdown and guarantees that we > do not log metadata changes with invalid LSNs on disk. Since this is a > known issue with a known recovery path, present a warning to instruct > the user how to recover. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > > Here's a first non-rfc version of v5 sb metadata LSN verification. The > biggest difference with this version is the attempt to mitigate lock > contention in the LSN helper and corresponding tweak to how the log > fields are updated when the head wraps around the end of the log. The > idea is to do a racy check and only acquire the lock when there appears > to be risk of an invalid LSN. > > AFAICS, this is not safe with the current code as the current LSN can be > sampled in a state that transiently pushes the LSN forward to the end of > the next cycle (if the cycle is updated, both cycle/block are sampled, > and then the block is updated). This means that a false negative can > occur if an invalid LSN happens to exist but is within the bounds of the > next full log cycle. Therefore, this patch also tweaks the current LSN > update and sample code to rewind the current block before the cycle is > updated and uses a memory barrier to guarantee that the sample code > always sees the rewound current block if it sees the updated cycle. Note > that it is still possible to see the rewound block before the cycle > update (a transient backwards LSN), but this is safe because such false > positives are explicitly reverified with the lock held before failure is > triggered. > > Otherwise, this refactors/relocates the helpers and converts back to the > more simple behavior of verifier failure on detection of invalid > metadata LSN. Thoughts, reviews, flames appreciated. > > Brian ..... > @@ -243,8 +244,14 @@ bool > xfs_btree_lblock_verify_crc( > struct xfs_buf *bp) > { > - if (xfs_sb_version_hascrc(&bp->b_target->bt_mount->m_sb)) > + struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp); > + struct xfs_mount *mp = bp->b_target->bt_mount; > + > + if (xfs_sb_version_hascrc(&bp->b_target->bt_mount->m_sb)) { mp->m_sb. > xfs_btree_sblock_verify_crc( > struct xfs_buf *bp) > { > - if (xfs_sb_version_hascrc(&bp->b_target->bt_mount->m_sb)) > + struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp); > + struct xfs_mount *mp = bp->b_target->bt_mount; > + > + if (xfs_sb_version_hascrc(&bp->b_target->bt_mount->m_sb)) { Ditto. > diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c > index be43248..e89a0f8 100644 > --- a/fs/xfs/libxfs/xfs_da_btree.c > +++ b/fs/xfs/libxfs/xfs_da_btree.c > @@ -39,6 +39,7 @@ > #include "xfs_trace.h" > #include "xfs_cksum.h" > #include "xfs_buf_item.h" > +#include "xfs_log.h" > > /* > * xfs_da_btree.c > @@ -150,6 +151,8 @@ xfs_da3_node_verify( > return false; > if (be64_to_cpu(hdr3->info.blkno) != bp->b_bn) > return false; > + if (!xfs_log_check_lsn(mp, be64_to_cpu(hdr3->info.lsn))) > + return false; > } else { > if (ichdr.magic != XFS_DA_NODE_MAGIC) > return false; > @@ -322,6 +325,7 @@ xfs_da3_node_create( > if (xfs_sb_version_hascrc(&mp->m_sb)) { > struct xfs_da3_node_hdr *hdr3 = bp->b_addr; > > + memset(hdr3, 0, sizeof(struct xfs_da3_node_hdr)); > ichdr.magic = XFS_DA3_NODE_MAGIC; > hdr3->info.blkno = cpu_to_be64(bp->b_bn); > hdr3->info.owner = cpu_to_be64(args->dp->i_ino); Is this a bug fix or just cleanliness? The LSN gets written into the header before it goes to disk, so there is nothing uninitialised that I am unaware of ending up on disk from this. > @@ -60,6 +61,7 @@ xfs_symlink_hdr_set( > if (!xfs_sb_version_hascrc(&mp->m_sb)) > return 0; > > + memset(dsl, 0, sizeof(struct xfs_dsymlink_hdr)); > dsl->sl_magic = cpu_to_be32(XFS_SYMLINK_MAGIC); > dsl->sl_offset = cpu_to_be32(offset); > dsl->sl_bytes = cpu_to_be32(size); Ditto. > index aaadee0..0c8ef76 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -3165,11 +3165,19 @@ xlog_state_switch_iclogs( > } > > if (log->l_curr_block >= log->l_logBBsize) { > + /* > + * Rewind the current block before the cycle is bumped to make > + * sure that the combined LSN never transiently moves forward > + * when the log wraps to the next cycle. This is to support the > + * unlocked sample of these fields from xlog_valid_lsn(). Most > + * other cases should acquire l_icloglock. > + */ > + log->l_curr_block -= log->l_logBBsize; > + ASSERT(log->l_curr_block >= 0); > + smp_wmb(); > log->l_curr_cycle++; OK, so we make sure the write of the l_curr_block cannot be reordered to after the write of the l_curr_cycle. > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 950f3f9..8daba74 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -560,4 +560,55 @@ static inline void xlog_wait(wait_queue_head_t *wq, spinlock_t *lock) > remove_wait_queue(wq, &wait); > } > > +/* > + * The LSN is valid so long as it is behind the current LSN. If it isn't, this > + * means that the next log record that includes this metadata could have a > + * smaller LSN. In turn, this means that the modification in the log would not > + * replay. > + */ > +static inline bool > +xlog_valid_lsn( > + struct xlog *log, > + xfs_lsn_t lsn) > +{ > + int cur_cycle; > + int cur_block; > + bool valid = true; > + > + /* > + * First, sample the current lsn without locking to avoid added > + * contention from metadata I/O. The current cycle and block are updated > + * (in xlog_state_switch_iclogs()) and read here in a particular order > + * to avoid false negatives (e.g., thinking the metadata LSN is valid > + * when it is not). > + * > + * The current block is always rewound before the cycle is bumped in > + * xlog_state_switch_iclogs() to ensure the current LSN is never seen in > + * a transiently forward state. Instead, we can see the LSN in a > + * transiently behind state if we happen to race with a cycle wrap. > + */ > + cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > + smp_rmb(); > + cur_block = ACCESS_ONCE(log->l_curr_block); And we make sure the read of the block cannot be reordered to before the l_curr_cycle. Looks ok to me. I'm going to commit this as is because I don't think the memset()s above actuallly change anything on disk. I can always redo the commit if this is in error... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs