On Thu, Mar 14, 2013 at 11:03:21AM -0500, Ben Myers wrote: > Dave, > > On Tue, Mar 12, 2013 at 11:30:42PM +1100, Dave Chinner wrote: > > From: Christoph Hellwig <hch@xxxxxx> > > > > Add a new inode version with a larger core. The primary objective is > > to allow for a crc of the inode, and location information (uuid and ino) > > to verify it was written in the right place. We also extend it by: > > > > a creation time (for Samba); > > a changecount (for NFSv4); > > a flush sequence (in LSN format for recovery); > > an additional inode flags field; and > > some additional padding. > > > > These additional fields are not implemented yet, but already laid > > out in the structure. > > > > [dchinner@xxxxxxxxxx] Added LSN and flags field, some factoring and rework to > > capture all the necessary information in the crc calculation. > > Comments and questions below. .... > > @@ -190,8 +191,18 @@ xfs_ialloc_inode_init( > > * the new inode format, then use the new inode version. Otherwise > > * use the old version so that old kernels will continue to be > > * able to use the file system. > > + * > > + * For v3 inodes, we also need to write the inode number into the inode, > > + * so calculate the first inode number of the chunk here as > > + * XFS_OFFBNO_TO_AGINO() only works on filesystem block boundaries, not > > + * cluster boundaries and so cannot be used in the cluster buffer loop > > + * below. > > I'm having some trouble understanding your comment. Maybe you can help me: > > > */ > > - if (xfs_sb_version_hasnlink(&mp->m_sb)) > > + if (xfs_sb_version_hascrc(&mp->m_sb)) { > > + version = 3; > > + ino = XFS_AGINO_TO_INO(mp, agno, > > + XFS_OFFBNO_TO_AGINO(mp, agbno, 0)); > > + } else if (xfs_sb_version_hasnlink(&mp->m_sb)) > > version = 2; > > else > > version = 1; > > @@ -217,13 +228,21 @@ xfs_ialloc_inode_init( > > My reading of the loop here is ... > > 210 for (j = 0; j < nbufs; j++) { > > for each inode cluster, j > > 211 /* > 212 * Get the block. > 213 */ > 214 d = XFS_AGB_TO_DADDR(mp, agno, agbno + (j * blks_per_cluster)); > > convert to disk address ( this AG, the AGBLOCK of the initial inode cluster plus > (current cluster j * blocks per cluster)) > > 215 fbuf = xfs_trans_get_buf(tp, mp->m_ddev_targp, d, > 216 mp->m_bsize * blks_per_cluster, > 217 XBF_UNMAPPED); > > get a buffer at that disk address of length (filesystem block size times the number of blocks per cluster) > > which is the full length of the inode cluster > > 218 if (!fbuf) > 219 return ENOMEM; > 220 /* > 221 * Initialize all inodes in this buffer and then log them. > 222 * > 223 * XXX: It would be much better if we had just one transaction > 224 * to log a whole cluster of inodes instead of all the > 225 * individual transactions causing a lot of log traffic. > 226 */ > 227 fbuf->b_ops = &xfs_inode_buf_ops; > 228 xfs_buf_zero(fbuf, 0, ninodes << mp->m_sb.sb_inodelog); > > Zero the whole cluster, including literal areas > > 229 for (i = 0; i < ninodes; i++) { > > for each inode, i > > 230 int ioffset = i << mp->m_sb.sb_inodelog; > 231 uint isize = xfs_dinode_size(version); > 232 > 233 free = xfs_make_iptr(mp, fbuf, i); > > get a pointer into the buf to the beginning of i's inode core > > 234 free->di_magic = cpu_to_be16(XFS_DINODE_MAGIC); > 235 free->di_version = version; > 236 free->di_gen = cpu_to_be32(gen); > 237 free->di_next_unlinked = cpu_to_be32(NULLAGINO); > > initialize some important stuff > > 238 > 239 if (version == 3) { > 240 free->di_ino = cpu_to_be64(ino); > 241 ino++; > > initialize ino on verion 3 inodes. and add one to ino for the next run of this loop. > > It appears that for subsequent clusters where j > 1 this would stamp the wrong > ino into the inode. If it was stamping incorrect numbers into the inodes, the verifiers would pick that up straight away. That's how I found that my initial code was wrong. > Something like this would be better: > ino = XFS_AGINO_TO_INO(mp, agno, > XFS_OFFBNO_TO_AGINO(mp, agbno + (j * blks_per_cluster), i)); > free->di_ino = cpu_to_be64(ino); And that's exactly what my initial code did, and the verifiers pointed out that every second filesystem block in an inode cluster had incorrect inode numbers in it. Hence I changed the code to what I have now and added the comment about XFS_OFFBNO_TO_AGINO only working within a filesystem block, not across multiple filesystem blocks.... (Finding this sort of problem is one of the reasons the verifiers came first ;) FWIW, 4k block size filesystem exercise the j > 0 path as the minimum chunk size is 16k, and the cluster size is 8k. Hence we have nbufs = 2, and we initialise 32 inodes per cluster buffer. For 512 byte inodes, we have nbufs = 4 and we initialise 16 inodes per cluster buffer. So this code is most definitely being exercised and the output is correct as far as I can validate... $ for i in `seq 64 1 127`; do > sudo xfs_db -c "inode $i" -c "p v3.inumber" /dev/vdc > done v3.inumber = 64 v3.inumber = 65 v3.inumber = 66 v3.inumber = 67 v3.inumber = 68 v3.inumber = 69 v3.inumber = 70 v3.inumber = 71 v3.inumber = 72 v3.inumber = 73 v3.inumber = 74 v3.inumber = 75 v3.inumber = 76 v3.inumber = 77 v3.inumber = 78 v3.inumber = 79 v3.inumber = 80 v3.inumber = 81 v3.inumber = 82 v3.inumber = 83 v3.inumber = 84 v3.inumber = 85 v3.inumber = 86 v3.inumber = 87 v3.inumber = 88 v3.inumber = 89 v3.inumber = 90 v3.inumber = 91 v3.inumber = 92 v3.inumber = 93 v3.inumber = 94 v3.inumber = 95 v3.inumber = 96 v3.inumber = 97 v3.inumber = 98 v3.inumber = 99 v3.inumber = 100 v3.inumber = 101 v3.inumber = 102 v3.inumber = 103 v3.inumber = 104 v3.inumber = 105 v3.inumber = 106 v3.inumber = 107 v3.inumber = 108 v3.inumber = 109 v3.inumber = 110 v3.inumber = 111 v3.inumber = 112 v3.inumber = 113 v3.inumber = 114 v3.inumber = 115 v3.inumber = 116 v3.inumber = 117 v3.inumber = 118 v3.inumber = 119 v3.inumber = 120 v3.inumber = 121 v3.inumber = 122 v3.inumber = 123 v3.inumber = 124 v3.inumber = 125 v3.inumber = 126 v3.inumber = 127 > > xfs_buf_zero(fbuf, 0, ninodes << mp->m_sb.sb_inodelog); > > for (i = 0; i < ninodes; i++) { > > int ioffset = i << mp->m_sb.sb_inodelog; > > - uint isize = sizeof(struct xfs_dinode); > > + uint isize = xfs_dinode_size(version); > > > > free = xfs_make_iptr(mp, fbuf, i); > > free->di_magic = cpu_to_be16(XFS_DINODE_MAGIC); > > free->di_version = version; > > free->di_gen = cpu_to_be32(gen); > > free->di_next_unlinked = cpu_to_be32(NULLAGINO); > > + > > + if (version == 3) { > > + free->di_ino = cpu_to_be64(ino); > > + ino++; > > + uuid_copy(&free->di_uuid, &mp->m_sb.sb_uuid); > > + xfs_dinode_calc_crc(mp, free); > > + } > > + > > xfs_trans_log_buf(tp, fbuf, ioffset, ioffset + isize - 1); > > If I have it right, it's ok not to log the literal are here (even though the > crc was calculated including the literal area) because the log is protected by > its own crcs and recovery will recalculate the crc. Prior to CRCs it's OK not to log the literal areas because the contents really don't matter. The entire buffer is zeroed because it's faster than zeroing individual inode cores one by one and it ensures that we can always tell a freshly allocated inode block with xfs_db because the literal areas are all zero (i.e. good for debugging). But these are conveniences, not a necessity, and hence the advantage of not logging the literal areas reduces the overhead of logging inode allocations *significantly*. > What do we have in the > literal area after log replay in that case? For non-CRC inode buffers, it doesn't matter. But you are right that it does matter for CRC enabled inode buffers as it will result in the CRC in the inode core being incorrect. I'l havea think about this - there are a couple of potential ways of solving the problem, and I need to think about them a bit first. /me is now wondering if he should add his old "allocation create transaction" patch in here to completely avoid the need for logging inode buffers here for CRC enabled filesystems.... > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > > index d750c34..6d08eaa 100644 > > --- a/fs/xfs/xfs_log_recover.c > > +++ b/fs/xfs/xfs_log_recover.c > > @@ -1786,6 +1786,7 @@ xlog_recover_do_inode_buffer( > > xfs_agino_t *buffer_nextp; > > > > trace_xfs_log_recover_buf_inode_buf(mp->m_log, buf_f); > > + bp->b_ops = &xfs_inode_buf_ops; > > > > inodes_per_buf = BBTOB(bp->b_io_length) >> mp->m_sb.sb_inodelog; > > for (i = 0; i < inodes_per_buf; i++) { > > @@ -1930,6 +1931,9 @@ xlog_recover_do_reg_buffer( > > /* Shouldn't be any more regions */ > > ASSERT(i == item->ri_total); > > > > + /* Shouldn't be any more regions */ > > + ASSERT(i == item->ri_total); > > + > > That appears to be duplicate of the assert above it. Argh. Stupid tool problem - that hunk should have given a merge failure, not applied with fuzz. I'll fix it up - a later patch probably removes it.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs