Re: [PATCH 2/9] xfs: add support for large btree blocks

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 22 Feb 2013 12:34:56 +1100

On Fri, Feb 15, 2013 at 03:20:15PM -0600, Ben Myers wrote:
> On Tue, Jan 22, 2013 at 12:25:53AM +1100, Dave Chinner wrote:
> > From: Christoph Hellwig <hch@xxxxxx>
> > 
> > Add support for larger btree blocks that contains a CRC32C checksum,
> > a filesystem uuid and block number for detecting filesystem
> > consistency and out of place writes.
> > 
> > [dchinner@xxxxxxxxxx] Also include an owner field to allow reverse
> > mappings to be implemented for improved repairability and a LSN
> > field to so that log recovery can easily determine the last
> > modification that made it to disk for each buffer.
> > 
> > [dchinner@xxxxxxxxxx] Add buffer log format flags to indicate the
> > type of buffer to recovery so that we don't have to do blind magic
> > number tests to determine what the buffer is.
> > 
> > [dchinner@xxxxxxxxxx] Modified to fit into the verifier structure.
> 
> This patch is far too large for a good review.  It needs to be split up into
> it's various ideas which you outlined in patch 0.  If you need to add dead code
> in each piece and then enable it at the end, that's fine with me.

You want it broken up into 7 or 8 separate patches - what does it
gain us? It'll take a week for me to do the patch monkeying and to
retest and validate the resulting stack (i.e. it is bisectable, each
patch passes xfstests, etc), and in the end the code will be
identical.

As I've said before, there's a point where the tradeoff for
splitting patches up doesn't make sense. Asking a developer to do
days of work to end up with exactly the same code to save the
reviewer an hour or two is *not* a good tradeoff. Especially for the
first patch of a much larger 15-20 patch series which contains
several larger and more complex patches....

> Some comments below.
....
> > @@ -59,10 +61,10 @@ typedef __be32 xfs_alloc_ptr_t;
> >  
> >  /*
> >   * Btree block header size depends on a superblock flag.
> > - *
> > - * (not quite yet, but soon)
> >   */
> > -#define XFS_ALLOC_BLOCK_LEN(mp)	XFS_BTREE_SBLOCK_LEN
> > +#define XFS_ALLOC_BLOCK_LEN(mp) \
> > +	(xfs_sb_version_hascrc(&((mp)->m_sb)) ? \
> > +		XFS_BTREE_SBLOCK_CRC_LEN : XFS_BTREE_SBLOCK_LEN)
> 
> The allocbt changes seem to be a good fit for their own patch.

It's used in exactly 4 places and three of them are in the next 10
lines. That's trivial to validate as correct without needing them in
a separate patch.

> >  
> >  /*
> >   * Record, key, and pointer address macros for btree blocks.
> > diff --git a/fs/xfs/xfs_attr_leaf.c b/fs/xfs/xfs_attr_leaf.c
> > index f96a734..aa4765f 100644
> > --- a/fs/xfs/xfs_attr_leaf.c
> > +++ b/fs/xfs/xfs_attr_leaf.c
> > @@ -232,7 +232,7 @@ xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes)
> >  				return 0;
> >  			return dp->i_d.di_forkoff;
> >  		}
> > -		dsize = XFS_BMAP_BROOT_SPACE(dp->i_df.if_broot);
> > +		dsize = XFS_BMAP_BROOT_SPACE(mp, dp->i_df.if_broot);
> 
> Changes to XFS_BMAP_BROOT_SPACE are a good candidate for a separate patch, just
> as with LITINO.

Used in only 2 places. Trivial to validate.

> > @@ -3155,8 +3163,8 @@ xfs_bmap_extents_to_btree(
> >  	 * Do all this logging at the end so that
> >  	 * the root is at the right level.
> >  	 */
> > -	xfs_btree_log_block(cur, abp, XFS_BB_ALL_BITS);
> >  	xfs_btree_log_recs(cur, abp, 1, be16_to_cpu(ablock->bb_numrecs));
> > +	xfs_btree_log_block(cur, abp, XFS_BB_ALL_BITS);
> 
> Huh.  Why was that necessary?

It isn't.

> > @@ -3268,8 +3276,12 @@ xfs_bmap_local_to_extents(
> >  		*firstblock = args.fsbno;
> >  		bp = xfs_btree_get_bufl(args.mp, tp, args.fsbno, 0);
> >  		bp->b_ops = &xfs_bmbt_buf_ops;
> > +
> >  		memcpy(bp->b_addr, ifp->if_u1.if_data, ifp->if_bytes);
> 
> Conflicts now due to 1e82379b0

I know. It's fixed in my local copy....

> 
> > +
> > +		xfs_trans_buf_set_type(tp, bp, XFS_BLF_BTREE_BUF);
> 
> The xfs_trans_buf_set_type changes are a good candidate for a separate patch.

I could out the xfs_trans_buf_set_type() calls and infrstructure,
but that's only 20-30 lines of code. The rest of the code that
actually uses it (e.g. the log recovery code, the calls to set the
type) goes along with the CRC modifications, so the majority of the
change still has to sit in the current patch. Double handling the
code by adding some in one patch and the rest in another makes the
patches a pain to modify and test, and when it gives the same end
result.....

> >  /*
> >   * Btree block header size depends on a superblock flag.
> > - *
> > - * (not quite yet, but soon)
> >   */
> > -#define XFS_BMBT_BLOCK_LEN(mp)	XFS_BTREE_LBLOCK_LEN
> > +#define XFS_BMBT_BLOCK_LEN(mp) \
> > +	(xfs_sb_version_hascrc(&((mp)->m_sb)) ? \
> > +		XFS_BTREE_LBLOCK_CRC_LEN : XFS_BTREE_LBLOCK_LEN)
> >  
> >  #define XFS_BMBT_REC_ADDR(mp, block, index) \
> >  	((xfs_bmbt_rec_t *) \
> > @@ -186,12 +187,12 @@ typedef __be64 xfs_bmbt_ptr_t, xfs_bmdr_ptr_t;
> >  #define XFS_BMAP_BROOT_PTR_ADDR(mp, bb, i, sz) \
> >  	XFS_BMBT_PTR_ADDR(mp, bb, i, xfs_bmbt_maxrecs(mp, sz, 0))
> >  
> > -#define XFS_BMAP_BROOT_SPACE_CALC(nrecs) \
> > -	(int)(XFS_BTREE_LBLOCK_LEN + \
> > +#define XFS_BMAP_BROOT_SPACE_CALC(mp, nrecs) \
> > +	(int)(XFS_BMBT_BLOCK_LEN(mp) + \
> >  	       ((nrecs) * (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t))))
> >  
> > -#define XFS_BMAP_BROOT_SPACE(bb) \
> > -	(XFS_BMAP_BROOT_SPACE_CALC(be16_to_cpu((bb)->bb_numrecs)))
> > +#define XFS_BMAP_BROOT_SPACE(mp, bb) \
> > +	(XFS_BMAP_BROOT_SPACE_CALC(mp, be16_to_cpu((bb)->bb_numrecs)))
> >  #define XFS_BMDR_SPACE_CALC(nrecs) \
> >  	(int)(sizeof(xfs_bmdr_block_t) + \
> >  	       ((nrecs) * (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t))))
> 
> Are these large broots going to force attributes out of line?

No. The BMAP_BROOT is kept in memory, the BMDR is what is kept in
the inode on disk. What is kept in the inode is unchanged. See
xfs_iformat_btree()/xfs_bmdr_to_bmbt() and
xfs_iflush_fork()//xfs_bmbt_to_bmdr().

> > @@ -277,10 +367,8 @@ xfs_btree_dup_cursor(
> >  				*ncur = NULL;
> >  				return error;
> >  			}
> > -			new->bc_bufs[i] = bp;
> > -			ASSERT(!xfs_buf_geterror(bp));
> > -		} else
> > -			new->bc_bufs[i] = NULL;
> > +		}
> > +		new->bc_bufs[i] = bp;
> 
> Why remove the assert?

Because if there is a bp->b_error is set in xfs_trans_read_buf(),
then the error is returned and no buffer is returned. IOWs, if a
buffer is returned, it is guaranteed that bp->b_error is 0.  We
don't assert this on every buffer returned by xfs_trans_read_buf()
and so this is simply making the code consistent with all the other
code that reads buffers.

> >  void
> > +xfs_btree_init_block_int(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_btree_block	*buf,
> > +	xfs_daddr_t		blkno,
> > +	__u32			magic,
> > +	__u16			level,
> > +	__u16			numrecs,
> > +	__u64			owner,
> > +	unsigned int		flags)
> > +{
> > +	buf->bb_magic = cpu_to_be32(magic);
> > +	buf->bb_level = cpu_to_be16(level);
> > +	buf->bb_numrecs = cpu_to_be16(numrecs);
> > +
> > +	if (flags & XFS_BTREE_LONG_PTRS) {
> > +		buf->bb_u.l.bb_leftsib = cpu_to_be64(NULLDFSBNO);
> > +		buf->bb_u.l.bb_rightsib = cpu_to_be64(NULLDFSBNO);
> > +		if (flags & XFS_BTREE_CRC_BLOCKS) {
> > +			buf->bb_u.l.bb_blkno = cpu_to_be64(blkno);
> > +			buf->bb_u.l.bb_owner = cpu_to_be64(owner);
> > +			uuid_copy(&buf->bb_u.l.bb_uuid, &mp->m_sb.sb_uuid);
> > +			buf->bb_u.l.bb_pad = 0;
> > +		}
> > +	} else {
> > +		/* owner is a 32 bit value on short blocks */
> > +		__u32 __owner = (__u32)owner;
> > +
> > +		buf->bb_u.s.bb_leftsib = cpu_to_be32(NULLAGBLOCK);
> > +		buf->bb_u.s.bb_rightsib = cpu_to_be32(NULLAGBLOCK);
> > +		if (flags & XFS_BTREE_CRC_BLOCKS) {
> > +			buf->bb_u.s.bb_blkno = cpu_to_be64(blkno);
> > +			buf->bb_u.s.bb_owner = cpu_to_be32(__owner);
> > +			uuid_copy(&buf->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid);
> > +		}
> > +	}
> > +}
> > +
> 
> The xfs_btree_init_block_int might be good in it's own patch.

All that means is I have to split these hunks into one patch that
factors it, and another patch that then adds the new functionality.
More handling of the same code means more opportunity for mistakes
and bugs.

> > +void
> >  xfs_btree_init_block(
> >  	struct xfs_mount *mp,
> >  	struct xfs_buf	*bp,
> >  	__u32		magic,
> >  	__u16		level,
> >  	__u16		numrecs,
> > +	__u64		owner,
> >  	unsigned int	flags)
> >  {
> > -	struct xfs_btree_block	*new = XFS_BUF_TO_BLOCK(bp);
> > -
> > -	new->bb_magic = cpu_to_be32(magic);
> > -	new->bb_level = cpu_to_be16(level);
> > -	new->bb_numrecs = cpu_to_be16(numrecs);
> > -
> > -	if (flags & XFS_BTREE_LONG_PTRS) {
> > -		new->bb_u.l.bb_leftsib = cpu_to_be64(NULLDFSBNO);
> > -		new->bb_u.l.bb_rightsib = cpu_to_be64(NULLDFSBNO);
> > -	} else {
> > -		new->bb_u.s.bb_leftsib = cpu_to_be32(NULLAGBLOCK);
> > -		new->bb_u.s.bb_rightsib = cpu_to_be32(NULLAGBLOCK);
> > -	}
> > +	xfs_btree_init_block_int(mp, XFS_BUF_TO_BLOCK(bp), bp->b_bn,
> > +				 magic, level, numrecs, owner, flags);
> >  }
> >  
> >  STATIC void
> >  xfs_btree_init_block_cur(
> >  	struct xfs_btree_cur	*cur,
> > +	struct xfs_buf		*bp,
> >  	int			level,
> > -	int			numrecs,
> > -	struct xfs_buf		*bp)
> > +	int			numrecs)
> 
> Why rearrange the order of the args here?

1. so I got compile errors when looking for code that might need
special CRC support.

2. so it is consistent with all the other functions that pass (cur,
bp, ....) in that order....

> 
> >  {
> > -	xfs_btree_init_block(cur->bc_mp, bp, xfs_magics[cur->bc_btnum],
> > -			       level, numrecs, cur->bc_flags);
> > +	__u64 owner;
> > +
> > +	/*
> > +	 * we can pull the owner from the cursor right now as the different
> > +	 * owners align directly with the pointer size of the btree. This may
> > +	 * change in future, but is safe for current users of the generic btree
> > +	 * code.
> > +	 */
> > +	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
> > +		owner = cur->bc_private.b.ip->i_ino;
> > +	else
> > +		owner = cur->bc_private.a.agno;
> 
> This is something I can look into a bit later... but I'm a little flummoxed by
> this one.  Apparently only inodes use long pointers?

The BMBT is what uses them, and by definition they are rooted in an
inode. Therefore, anything using a long pointer btree has an inode
as it's owner.

> 
> > +
> > +	xfs_btree_init_block_int(cur->bc_mp, XFS_BUF_TO_BLOCK(bp), bp->b_bn,
> > +				 xfs_btree_magic(cur), level, numrecs,
> > +				 owner, cur->bc_flags);
> >  }
> >  
> >  /*
> >   * Return true if ptr is the last record in the btree and
> > - * we need to track updateѕ to this record.  The decision
> > + * we need to track updates to this record.  The decision
> 
> I can't find a change in that line.  Time for new glasses?

It's got a multi-byte UTF-8 character in it. Check the s on the end
of "updates".

> > @@ -1223,7 +1361,12 @@ xfs_btree_log_block(
> >  		offsetof(struct xfs_btree_block, bb_numrecs),
> >  		offsetof(struct xfs_btree_block, bb_u.s.bb_leftsib),
> >  		offsetof(struct xfs_btree_block, bb_u.s.bb_rightsib),
> > -		XFS_BTREE_SBLOCK_LEN
> > +		offsetof(struct xfs_btree_block, bb_u.s.bb_blkno),
> > +		offsetof(struct xfs_btree_block, bb_u.s.bb_lsn),
> > +		offsetof(struct xfs_btree_block, bb_u.s.bb_uuid),
> > +		offsetof(struct xfs_btree_block, bb_u.s.bb_owner),
> > +		offsetof(struct xfs_btree_block, bb_u.s.bb_crc),
> > +		XFS_BTREE_SBLOCK_CRC_LEN
> >  	};
> >  	static const short	loffsets[] = {	/* table of offsets (long) */
> >  		offsetof(struct xfs_btree_block, bb_magic),
> > @@ -1231,17 +1374,40 @@ xfs_btree_log_block(
> >  		offsetof(struct xfs_btree_block, bb_numrecs),
> >  		offsetof(struct xfs_btree_block, bb_u.l.bb_leftsib),
> >  		offsetof(struct xfs_btree_block, bb_u.l.bb_rightsib),
> > -		XFS_BTREE_LBLOCK_LEN
> > +		offsetof(struct xfs_btree_block, bb_u.l.bb_blkno),
> > +		offsetof(struct xfs_btree_block, bb_u.l.bb_lsn),
> > +		offsetof(struct xfs_btree_block, bb_u.l.bb_uuid),
> > +		offsetof(struct xfs_btree_block, bb_u.l.bb_owner),
> > +		offsetof(struct xfs_btree_block, bb_u.l.bb_crc),
> > +		offsetof(struct xfs_btree_block, bb_u.l.bb_pad),
> > +		XFS_BTREE_LBLOCK_CRC_LEN
> >  	};
> >  
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> >  	XFS_BTREE_TRACE_ARGBI(cur, bp, fields);
> >  
> >  	if (bp) {
> > +		int nbits;
> > +
> > +		if (cur->bc_flags & XFS_BTREE_CRC_BLOCKS) {
> > +			/*
> > +			 * We don't log the CRC when updating a btree
> > +			 * block but instead recreate it during log
> > +			 * recovery.  As the log buffers have checksums
> > +			 * of their this is safe and avoids logging a crc
> 				   own
> 
> > +			 * update in a lot of places.
> > +			 */
> > +			if (fields == XFS_BB_ALL_BITS)
> > +				fields = XFS_BB_ALL_BITS_CRC;
> > +			nbits = XFS_BB_NUM_BITS_CRC;
> > +		} else {
> > +			nbits = XFS_BB_NUM_BITS;
> > +		}
> >  		xfs_btree_offsets(fields,
> >  				  (cur->bc_flags & XFS_BTREE_LONG_PTRS) ?
> >  					loffsets : soffsets,
> > -				  XFS_BB_NUM_BITS, &first, &last);
> > +				  nbits, &first, &last);
> > +		xfs_trans_buf_set_type(cur->bc_tp, bp, XFS_BLF_BTREE_BUF);
> >  		xfs_trans_log_buf(cur->bc_tp, bp, first, last);
> >  	} else {
> >  		xfs_trans_log_inode(cur->bc_tp, cur->bc_private.b.ip,

I'll answer your question about the values of XFS_BB_NUM_BITS and
XFS_BB_NUM_BITS_CRC while we are looking at the code that uses it.
The value is the index into the soffset/loffsets tables to determine
what range of fields in ibtree block headers are to logged.

> > @@ -56,10 +60,23 @@ struct xfs_btree_block {
> >  		struct {
> >  			__be32		bb_leftsib;
> >  			__be32		bb_rightsib;
> > +
> > +			__be64		bb_blkno;
> > +			__be64		bb_lsn;
> > +			uuid_t		bb_uuid;
> > +			__be32		bb_owner;
> > +			__le32		bb_crc;
> >  		} s;			/* short form pointers */
> >  		struct	{
> >  			__be64		bb_leftsib;
> >  			__be64		bb_rightsib;
> > +
> > +			__be64		bb_blkno;
> > +			__be64		bb_lsn;
> > +			uuid_t		bb_uuid;
> > +			__be64		bb_owner;
> > +			__le32		bb_crc;
> > +			__be32		bb_pad; /* padding for alignment */
> >  		} l;			/* long form pointers */
> >  	} bb_u;				/* rest */
> >  };
> 
> struct xfs_btree_block {
>         __be32          bb_magic;       /* magic number for block type */
>         __be16          bb_level;       /* 0 is a leaf */
>         __be16          bb_numrecs;     /* current # of data records */
>         union {
>                 struct {
>                         __be32          bb_leftsib;
>                         __be32          bb_rightsib;
> 		} s;			/* short form pointers */
>                 struct {
>                         __be32          bb_leftsib;
>                         __be32          bb_rightsib;
>                         __be64          bb_blkno;
>                         __be64          bb_lsn;
>                         uuid_t          bb_uuid;
>                         __be32          bb_owner;
>                         __le32          bb_crc;
>                 } s_crc;		/* short form pointers with crcs */
>                 struct  {
>                         __be64          bb_leftsib;
>                         __be64          bb_rightsib;
> 		} l;			/* long form pointers */
>                 struct  {
>                         __be64          bb_leftsib;
>                         __be64          bb_rightsib;
>                         __be64          bb_blkno;
>                         __be64          bb_lsn;
>                         uuid_t          bb_uuid;
>                         __be64          bb_owner;
>                         __le32          bb_crc;
>                         __be32          bb_pad; /* padding for alignment */
>                 } l_crc;		/* long form pointers with crcs */
>         } bb_u;                         /* rest */
> };

Right now the shared btree code only needs to know about 2 different
types (short and long), and the differences between crc/non crc
btree blocks is completely irrelevant to most of the code and hidden
deep in the btree block IO/initialisation layers.

Defining the structures as you've suggested duplicates fields between
structures - that adds maintenance burden (i.e. have to keep stuff in
sync, and leave coments/build bombs to ensure they stay that way).
It also means means that we'll have 4 types of structures that will be
used interchangably (i.e. bb_u.l in some places and bb_u.l_crc in
others). That is, IMO, going to be far more confusing than just
having a single type of structure for each of the long and short
forms, and more likely to lead to bugs in the future.

As it is, we already have to use magic number/version checks to
determine what fields are valid, so having different structure types
doesn't actually add any code documentation value as we will
only be accessing the extended structures inside blocks of code
where we know that we are supposed to access them...

> 
> > @@ -67,6 +84,16 @@ struct xfs_btree_block {
> >  #define XFS_BTREE_SBLOCK_LEN	16	/* size of a short form block */
> >  #define XFS_BTREE_LBLOCK_LEN	24	/* size of a long form block */
> >  
> > +/* sizes of CRC enabled btree blocks */
> > +#define XFS_BTREE_SBLOCK_CRC_LEN	(XFS_BTREE_SBLOCK_LEN + 40)
> > +#define XFS_BTREE_LBLOCK_CRC_LEN	(XFS_BTREE_LBLOCK_LEN + 48)
> > +
> > +
> > +#define XFS_BTREE_SBLOCK_CRC_OFF \
> > +	offsetof(struct xfs_btree_block, bb_u.s.bb_crc)
> > +#define XFS_BTREE_LBLOCK_CRC_OFF \
> > +	offsetof(struct xfs_btree_block, bb_u.l.bb_crc)
> > +
> >  
> >  /*
> >   * Generic key, ptr and record wrapper structures.
> > @@ -101,13 +128,11 @@ union xfs_btree_rec {
> >  #define	XFS_BB_NUMRECS		0x04
> >  #define	XFS_BB_LEFTSIB		0x08
> >  #define	XFS_BB_RIGHTSIB		0x10
> > +#define	XFS_BB_BLKNO		0x20
> >  #define	XFS_BB_NUM_BITS		5
> >  #define	XFS_BB_ALL_BITS		((1 << XFS_BB_NUM_BITS) - 1)
> 
> Did XFS_BB_BLKNO sneak in from a subsequent patch?

No, there is the possibility we only want to log a change to the
block number in the extended header, and so we need to add that as a
loggable index.

> 
> > -
> > -/*
> > - * Magic numbers for btree blocks.
> > - */
> > -extern const __uint32_t	xfs_magics[];
> > +#define	XFS_BB_NUM_BITS_CRC	8
> > +#define	XFS_BB_ALL_BITS_CRC	((1 << XFS_BB_NUM_BITS_CRC) - 1)
> 
> I'm a little confused here.  We were using 5 bits, you added BB_BLKNO for 6,
> and now you're setting NUM_BITS_CRC to 8?

See the soffsets/loffsets table comment above.

> Probably more readable with the (1<<0) idiom, as below.

Just using what ia already there.

> > diff --git a/fs/xfs/xfs_dinode.h b/fs/xfs/xfs_dinode.h
> > index 88a3368..6b5bd17 100644
> > --- a/fs/xfs/xfs_dinode.h
> > +++ b/fs/xfs/xfs_dinode.h
> > @@ -107,8 +107,8 @@ typedef enum xfs_dinode_fmt {
> >  #define XFS_LITINO(mp, version) \
> >  	((int)(((mp)->m_sb.sb_inodesize) - sizeof(struct xfs_dinode)))
> >  
> > -#define	XFS_BROOT_SIZE_ADJ	\
> > -	(XFS_BTREE_LBLOCK_LEN - sizeof(xfs_bmdr_block_t))
> > +#define XFS_BROOT_SIZE_ADJ(ip) \
> > +	(XFS_BMBT_BLOCK_LEN((ip)->i_mount) - sizeof(xfs_bmdr_block_t))
> 
> A good candidate for a separate patch.

Again, only used in 3 places so shoul dbe easy to verify even in the
large patch...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs