On Fri, Feb 15, 2013 at 03:20:15PM -0600, Ben Myers wrote: > On Tue, Jan 22, 2013 at 12:25:53AM +1100, Dave Chinner wrote: > > From: Christoph Hellwig <hch@xxxxxx> > > > > Add support for larger btree blocks that contains a CRC32C checksum, > > a filesystem uuid and block number for detecting filesystem > > consistency and out of place writes. > > > > [dchinner@xxxxxxxxxx] Also include an owner field to allow reverse > > mappings to be implemented for improved repairability and a LSN > > field to so that log recovery can easily determine the last > > modification that made it to disk for each buffer. > > > > [dchinner@xxxxxxxxxx] Add buffer log format flags to indicate the > > type of buffer to recovery so that we don't have to do blind magic > > number tests to determine what the buffer is. > > > > [dchinner@xxxxxxxxxx] Modified to fit into the verifier structure. > > This patch is far too large for a good review. It needs to be split up into > it's various ideas which you outlined in patch 0. If you need to add dead code > in each piece and then enable it at the end, that's fine with me. You want it broken up into 7 or 8 separate patches - what does it gain us? It'll take a week for me to do the patch monkeying and to retest and validate the resulting stack (i.e. it is bisectable, each patch passes xfstests, etc), and in the end the code will be identical. As I've said before, there's a point where the tradeoff for splitting patches up doesn't make sense. Asking a developer to do days of work to end up with exactly the same code to save the reviewer an hour or two is *not* a good tradeoff. Especially for the first patch of a much larger 15-20 patch series which contains several larger and more complex patches.... > Some comments below. .... > > @@ -59,10 +61,10 @@ typedef __be32 xfs_alloc_ptr_t; > > > > /* > > * Btree block header size depends on a superblock flag. > > - * > > - * (not quite yet, but soon) > > */ > > -#define XFS_ALLOC_BLOCK_LEN(mp) XFS_BTREE_SBLOCK_LEN > > +#define XFS_ALLOC_BLOCK_LEN(mp) \ > > + (xfs_sb_version_hascrc(&((mp)->m_sb)) ? \ > > + XFS_BTREE_SBLOCK_CRC_LEN : XFS_BTREE_SBLOCK_LEN) > > The allocbt changes seem to be a good fit for their own patch. It's used in exactly 4 places and three of them are in the next 10 lines. That's trivial to validate as correct without needing them in a separate patch. > > > > /* > > * Record, key, and pointer address macros for btree blocks. > > diff --git a/fs/xfs/xfs_attr_leaf.c b/fs/xfs/xfs_attr_leaf.c > > index f96a734..aa4765f 100644 > > --- a/fs/xfs/xfs_attr_leaf.c > > +++ b/fs/xfs/xfs_attr_leaf.c > > @@ -232,7 +232,7 @@ xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes) > > return 0; > > return dp->i_d.di_forkoff; > > } > > - dsize = XFS_BMAP_BROOT_SPACE(dp->i_df.if_broot); > > + dsize = XFS_BMAP_BROOT_SPACE(mp, dp->i_df.if_broot); > > Changes to XFS_BMAP_BROOT_SPACE are a good candidate for a separate patch, just > as with LITINO. Used in only 2 places. Trivial to validate. > > @@ -3155,8 +3163,8 @@ xfs_bmap_extents_to_btree( > > * Do all this logging at the end so that > > * the root is at the right level. > > */ > > - xfs_btree_log_block(cur, abp, XFS_BB_ALL_BITS); > > xfs_btree_log_recs(cur, abp, 1, be16_to_cpu(ablock->bb_numrecs)); > > + xfs_btree_log_block(cur, abp, XFS_BB_ALL_BITS); > > Huh. Why was that necessary? It isn't. > > @@ -3268,8 +3276,12 @@ xfs_bmap_local_to_extents( > > *firstblock = args.fsbno; > > bp = xfs_btree_get_bufl(args.mp, tp, args.fsbno, 0); > > bp->b_ops = &xfs_bmbt_buf_ops; > > + > > memcpy(bp->b_addr, ifp->if_u1.if_data, ifp->if_bytes); > > Conflicts now due to 1e82379b0 I know. It's fixed in my local copy.... > > > + > > + xfs_trans_buf_set_type(tp, bp, XFS_BLF_BTREE_BUF); > > The xfs_trans_buf_set_type changes are a good candidate for a separate patch. I could out the xfs_trans_buf_set_type() calls and infrstructure, but that's only 20-30 lines of code. The rest of the code that actually uses it (e.g. the log recovery code, the calls to set the type) goes along with the CRC modifications, so the majority of the change still has to sit in the current patch. Double handling the code by adding some in one patch and the rest in another makes the patches a pain to modify and test, and when it gives the same end result..... > > /* > > * Btree block header size depends on a superblock flag. > > - * > > - * (not quite yet, but soon) > > */ > > -#define XFS_BMBT_BLOCK_LEN(mp) XFS_BTREE_LBLOCK_LEN > > +#define XFS_BMBT_BLOCK_LEN(mp) \ > > + (xfs_sb_version_hascrc(&((mp)->m_sb)) ? \ > > + XFS_BTREE_LBLOCK_CRC_LEN : XFS_BTREE_LBLOCK_LEN) > > > > #define XFS_BMBT_REC_ADDR(mp, block, index) \ > > ((xfs_bmbt_rec_t *) \ > > @@ -186,12 +187,12 @@ typedef __be64 xfs_bmbt_ptr_t, xfs_bmdr_ptr_t; > > #define XFS_BMAP_BROOT_PTR_ADDR(mp, bb, i, sz) \ > > XFS_BMBT_PTR_ADDR(mp, bb, i, xfs_bmbt_maxrecs(mp, sz, 0)) > > > > -#define XFS_BMAP_BROOT_SPACE_CALC(nrecs) \ > > - (int)(XFS_BTREE_LBLOCK_LEN + \ > > +#define XFS_BMAP_BROOT_SPACE_CALC(mp, nrecs) \ > > + (int)(XFS_BMBT_BLOCK_LEN(mp) + \ > > ((nrecs) * (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t)))) > > > > -#define XFS_BMAP_BROOT_SPACE(bb) \ > > - (XFS_BMAP_BROOT_SPACE_CALC(be16_to_cpu((bb)->bb_numrecs))) > > +#define XFS_BMAP_BROOT_SPACE(mp, bb) \ > > + (XFS_BMAP_BROOT_SPACE_CALC(mp, be16_to_cpu((bb)->bb_numrecs))) > > #define XFS_BMDR_SPACE_CALC(nrecs) \ > > (int)(sizeof(xfs_bmdr_block_t) + \ > > ((nrecs) * (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t)))) > > Are these large broots going to force attributes out of line? No. The BMAP_BROOT is kept in memory, the BMDR is what is kept in the inode on disk. What is kept in the inode is unchanged. See xfs_iformat_btree()/xfs_bmdr_to_bmbt() and xfs_iflush_fork()//xfs_bmbt_to_bmdr(). > > @@ -277,10 +367,8 @@ xfs_btree_dup_cursor( > > *ncur = NULL; > > return error; > > } > > - new->bc_bufs[i] = bp; > > - ASSERT(!xfs_buf_geterror(bp)); > > - } else > > - new->bc_bufs[i] = NULL; > > + } > > + new->bc_bufs[i] = bp; > > Why remove the assert? Because if there is a bp->b_error is set in xfs_trans_read_buf(), then the error is returned and no buffer is returned. IOWs, if a buffer is returned, it is guaranteed that bp->b_error is 0. We don't assert this on every buffer returned by xfs_trans_read_buf() and so this is simply making the code consistent with all the other code that reads buffers. > > void > > +xfs_btree_init_block_int( > > + struct xfs_mount *mp, > > + struct xfs_btree_block *buf, > > + xfs_daddr_t blkno, > > + __u32 magic, > > + __u16 level, > > + __u16 numrecs, > > + __u64 owner, > > + unsigned int flags) > > +{ > > + buf->bb_magic = cpu_to_be32(magic); > > + buf->bb_level = cpu_to_be16(level); > > + buf->bb_numrecs = cpu_to_be16(numrecs); > > + > > + if (flags & XFS_BTREE_LONG_PTRS) { > > + buf->bb_u.l.bb_leftsib = cpu_to_be64(NULLDFSBNO); > > + buf->bb_u.l.bb_rightsib = cpu_to_be64(NULLDFSBNO); > > + if (flags & XFS_BTREE_CRC_BLOCKS) { > > + buf->bb_u.l.bb_blkno = cpu_to_be64(blkno); > > + buf->bb_u.l.bb_owner = cpu_to_be64(owner); > > + uuid_copy(&buf->bb_u.l.bb_uuid, &mp->m_sb.sb_uuid); > > + buf->bb_u.l.bb_pad = 0; > > + } > > + } else { > > + /* owner is a 32 bit value on short blocks */ > > + __u32 __owner = (__u32)owner; > > + > > + buf->bb_u.s.bb_leftsib = cpu_to_be32(NULLAGBLOCK); > > + buf->bb_u.s.bb_rightsib = cpu_to_be32(NULLAGBLOCK); > > + if (flags & XFS_BTREE_CRC_BLOCKS) { > > + buf->bb_u.s.bb_blkno = cpu_to_be64(blkno); > > + buf->bb_u.s.bb_owner = cpu_to_be32(__owner); > > + uuid_copy(&buf->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid); > > + } > > + } > > +} > > + > > The xfs_btree_init_block_int might be good in it's own patch. All that means is I have to split these hunks into one patch that factors it, and another patch that then adds the new functionality. More handling of the same code means more opportunity for mistakes and bugs. > > +void > > xfs_btree_init_block( > > struct xfs_mount *mp, > > struct xfs_buf *bp, > > __u32 magic, > > __u16 level, > > __u16 numrecs, > > + __u64 owner, > > unsigned int flags) > > { > > - struct xfs_btree_block *new = XFS_BUF_TO_BLOCK(bp); > > - > > - new->bb_magic = cpu_to_be32(magic); > > - new->bb_level = cpu_to_be16(level); > > - new->bb_numrecs = cpu_to_be16(numrecs); > > - > > - if (flags & XFS_BTREE_LONG_PTRS) { > > - new->bb_u.l.bb_leftsib = cpu_to_be64(NULLDFSBNO); > > - new->bb_u.l.bb_rightsib = cpu_to_be64(NULLDFSBNO); > > - } else { > > - new->bb_u.s.bb_leftsib = cpu_to_be32(NULLAGBLOCK); > > - new->bb_u.s.bb_rightsib = cpu_to_be32(NULLAGBLOCK); > > - } > > + xfs_btree_init_block_int(mp, XFS_BUF_TO_BLOCK(bp), bp->b_bn, > > + magic, level, numrecs, owner, flags); > > } > > > > STATIC void > > xfs_btree_init_block_cur( > > struct xfs_btree_cur *cur, > > + struct xfs_buf *bp, > > int level, > > - int numrecs, > > - struct xfs_buf *bp) > > + int numrecs) > > Why rearrange the order of the args here? 1. so I got compile errors when looking for code that might need special CRC support. 2. so it is consistent with all the other functions that pass (cur, bp, ....) in that order.... > > > { > > - xfs_btree_init_block(cur->bc_mp, bp, xfs_magics[cur->bc_btnum], > > - level, numrecs, cur->bc_flags); > > + __u64 owner; > > + > > + /* > > + * we can pull the owner from the cursor right now as the different > > + * owners align directly with the pointer size of the btree. This may > > + * change in future, but is safe for current users of the generic btree > > + * code. > > + */ > > + if (cur->bc_flags & XFS_BTREE_LONG_PTRS) > > + owner = cur->bc_private.b.ip->i_ino; > > + else > > + owner = cur->bc_private.a.agno; > > This is something I can look into a bit later... but I'm a little flummoxed by > this one. Apparently only inodes use long pointers? The BMBT is what uses them, and by definition they are rooted in an inode. Therefore, anything using a long pointer btree has an inode as it's owner. > > > + > > + xfs_btree_init_block_int(cur->bc_mp, XFS_BUF_TO_BLOCK(bp), bp->b_bn, > > + xfs_btree_magic(cur), level, numrecs, > > + owner, cur->bc_flags); > > } > > > > /* > > * Return true if ptr is the last record in the btree and > > - * we need to track updateѕ to this record. The decision > > + * we need to track updates to this record. The decision > > I can't find a change in that line. Time for new glasses? It's got a multi-byte UTF-8 character in it. Check the s on the end of "updates". > > @@ -1223,7 +1361,12 @@ xfs_btree_log_block( > > offsetof(struct xfs_btree_block, bb_numrecs), > > offsetof(struct xfs_btree_block, bb_u.s.bb_leftsib), > > offsetof(struct xfs_btree_block, bb_u.s.bb_rightsib), > > - XFS_BTREE_SBLOCK_LEN > > + offsetof(struct xfs_btree_block, bb_u.s.bb_blkno), > > + offsetof(struct xfs_btree_block, bb_u.s.bb_lsn), > > + offsetof(struct xfs_btree_block, bb_u.s.bb_uuid), > > + offsetof(struct xfs_btree_block, bb_u.s.bb_owner), > > + offsetof(struct xfs_btree_block, bb_u.s.bb_crc), > > + XFS_BTREE_SBLOCK_CRC_LEN > > }; > > static const short loffsets[] = { /* table of offsets (long) */ > > offsetof(struct xfs_btree_block, bb_magic), > > @@ -1231,17 +1374,40 @@ xfs_btree_log_block( > > offsetof(struct xfs_btree_block, bb_numrecs), > > offsetof(struct xfs_btree_block, bb_u.l.bb_leftsib), > > offsetof(struct xfs_btree_block, bb_u.l.bb_rightsib), > > - XFS_BTREE_LBLOCK_LEN > > + offsetof(struct xfs_btree_block, bb_u.l.bb_blkno), > > + offsetof(struct xfs_btree_block, bb_u.l.bb_lsn), > > + offsetof(struct xfs_btree_block, bb_u.l.bb_uuid), > > + offsetof(struct xfs_btree_block, bb_u.l.bb_owner), > > + offsetof(struct xfs_btree_block, bb_u.l.bb_crc), > > + offsetof(struct xfs_btree_block, bb_u.l.bb_pad), > > + XFS_BTREE_LBLOCK_CRC_LEN > > }; > > > > XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY); > > XFS_BTREE_TRACE_ARGBI(cur, bp, fields); > > > > if (bp) { > > + int nbits; > > + > > + if (cur->bc_flags & XFS_BTREE_CRC_BLOCKS) { > > + /* > > + * We don't log the CRC when updating a btree > > + * block but instead recreate it during log > > + * recovery. As the log buffers have checksums > > + * of their this is safe and avoids logging a crc > own > > > + * update in a lot of places. > > + */ > > + if (fields == XFS_BB_ALL_BITS) > > + fields = XFS_BB_ALL_BITS_CRC; > > + nbits = XFS_BB_NUM_BITS_CRC; > > + } else { > > + nbits = XFS_BB_NUM_BITS; > > + } > > xfs_btree_offsets(fields, > > (cur->bc_flags & XFS_BTREE_LONG_PTRS) ? > > loffsets : soffsets, > > - XFS_BB_NUM_BITS, &first, &last); > > + nbits, &first, &last); > > + xfs_trans_buf_set_type(cur->bc_tp, bp, XFS_BLF_BTREE_BUF); > > xfs_trans_log_buf(cur->bc_tp, bp, first, last); > > } else { > > xfs_trans_log_inode(cur->bc_tp, cur->bc_private.b.ip, I'll answer your question about the values of XFS_BB_NUM_BITS and XFS_BB_NUM_BITS_CRC while we are looking at the code that uses it. The value is the index into the soffset/loffsets tables to determine what range of fields in ibtree block headers are to logged. > > @@ -56,10 +60,23 @@ struct xfs_btree_block { > > struct { > > __be32 bb_leftsib; > > __be32 bb_rightsib; > > + > > + __be64 bb_blkno; > > + __be64 bb_lsn; > > + uuid_t bb_uuid; > > + __be32 bb_owner; > > + __le32 bb_crc; > > } s; /* short form pointers */ > > struct { > > __be64 bb_leftsib; > > __be64 bb_rightsib; > > + > > + __be64 bb_blkno; > > + __be64 bb_lsn; > > + uuid_t bb_uuid; > > + __be64 bb_owner; > > + __le32 bb_crc; > > + __be32 bb_pad; /* padding for alignment */ > > } l; /* long form pointers */ > > } bb_u; /* rest */ > > }; > > struct xfs_btree_block { > __be32 bb_magic; /* magic number for block type */ > __be16 bb_level; /* 0 is a leaf */ > __be16 bb_numrecs; /* current # of data records */ > union { > struct { > __be32 bb_leftsib; > __be32 bb_rightsib; > } s; /* short form pointers */ > struct { > __be32 bb_leftsib; > __be32 bb_rightsib; > __be64 bb_blkno; > __be64 bb_lsn; > uuid_t bb_uuid; > __be32 bb_owner; > __le32 bb_crc; > } s_crc; /* short form pointers with crcs */ > struct { > __be64 bb_leftsib; > __be64 bb_rightsib; > } l; /* long form pointers */ > struct { > __be64 bb_leftsib; > __be64 bb_rightsib; > __be64 bb_blkno; > __be64 bb_lsn; > uuid_t bb_uuid; > __be64 bb_owner; > __le32 bb_crc; > __be32 bb_pad; /* padding for alignment */ > } l_crc; /* long form pointers with crcs */ > } bb_u; /* rest */ > }; Right now the shared btree code only needs to know about 2 different types (short and long), and the differences between crc/non crc btree blocks is completely irrelevant to most of the code and hidden deep in the btree block IO/initialisation layers. Defining the structures as you've suggested duplicates fields between structures - that adds maintenance burden (i.e. have to keep stuff in sync, and leave coments/build bombs to ensure they stay that way). It also means means that we'll have 4 types of structures that will be used interchangably (i.e. bb_u.l in some places and bb_u.l_crc in others). That is, IMO, going to be far more confusing than just having a single type of structure for each of the long and short forms, and more likely to lead to bugs in the future. As it is, we already have to use magic number/version checks to determine what fields are valid, so having different structure types doesn't actually add any code documentation value as we will only be accessing the extended structures inside blocks of code where we know that we are supposed to access them... > > > @@ -67,6 +84,16 @@ struct xfs_btree_block { > > #define XFS_BTREE_SBLOCK_LEN 16 /* size of a short form block */ > > #define XFS_BTREE_LBLOCK_LEN 24 /* size of a long form block */ > > > > +/* sizes of CRC enabled btree blocks */ > > +#define XFS_BTREE_SBLOCK_CRC_LEN (XFS_BTREE_SBLOCK_LEN + 40) > > +#define XFS_BTREE_LBLOCK_CRC_LEN (XFS_BTREE_LBLOCK_LEN + 48) > > + > > + > > +#define XFS_BTREE_SBLOCK_CRC_OFF \ > > + offsetof(struct xfs_btree_block, bb_u.s.bb_crc) > > +#define XFS_BTREE_LBLOCK_CRC_OFF \ > > + offsetof(struct xfs_btree_block, bb_u.l.bb_crc) > > + > > > > /* > > * Generic key, ptr and record wrapper structures. > > @@ -101,13 +128,11 @@ union xfs_btree_rec { > > #define XFS_BB_NUMRECS 0x04 > > #define XFS_BB_LEFTSIB 0x08 > > #define XFS_BB_RIGHTSIB 0x10 > > +#define XFS_BB_BLKNO 0x20 > > #define XFS_BB_NUM_BITS 5 > > #define XFS_BB_ALL_BITS ((1 << XFS_BB_NUM_BITS) - 1) > > Did XFS_BB_BLKNO sneak in from a subsequent patch? No, there is the possibility we only want to log a change to the block number in the extended header, and so we need to add that as a loggable index. > > > - > > -/* > > - * Magic numbers for btree blocks. > > - */ > > -extern const __uint32_t xfs_magics[]; > > +#define XFS_BB_NUM_BITS_CRC 8 > > +#define XFS_BB_ALL_BITS_CRC ((1 << XFS_BB_NUM_BITS_CRC) - 1) > > I'm a little confused here. We were using 5 bits, you added BB_BLKNO for 6, > and now you're setting NUM_BITS_CRC to 8? See the soffsets/loffsets table comment above. > Probably more readable with the (1<<0) idiom, as below. Just using what ia already there. > > diff --git a/fs/xfs/xfs_dinode.h b/fs/xfs/xfs_dinode.h > > index 88a3368..6b5bd17 100644 > > --- a/fs/xfs/xfs_dinode.h > > +++ b/fs/xfs/xfs_dinode.h > > @@ -107,8 +107,8 @@ typedef enum xfs_dinode_fmt { > > #define XFS_LITINO(mp, version) \ > > ((int)(((mp)->m_sb.sb_inodesize) - sizeof(struct xfs_dinode))) > > > > -#define XFS_BROOT_SIZE_ADJ \ > > - (XFS_BTREE_LBLOCK_LEN - sizeof(xfs_bmdr_block_t)) > > +#define XFS_BROOT_SIZE_ADJ(ip) \ > > + (XFS_BMBT_BLOCK_LEN((ip)->i_mount) - sizeof(xfs_bmdr_block_t)) > > A good candidate for a separate patch. Again, only used in 3 places so shoul dbe easy to verify even in the large patch... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs