Re: [PATCH 01/11] xfs: separate inode geometry

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Thu, 30 May 2019 15:33:13 -0700

On Thu, May 30, 2019 at 11:18:33AM +1000, Dave Chinner wrote:
> On Wed, May 29, 2019 at 03:26:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > Separate the inode geometry information into a distinct structure.
> 
> I like the idea, but have lots of comments on the patch....
> 
> > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > ---
> >  fs/xfs/libxfs/xfs_format.h       |   33 +++++++++++-
> >  fs/xfs/libxfs/xfs_ialloc.c       |  109 ++++++++++++++++++++------------------
> >  fs/xfs/libxfs/xfs_ialloc.h       |    6 +-
> >  fs/xfs/libxfs/xfs_ialloc_btree.c |   15 +++--
> >  fs/xfs/libxfs/xfs_inode_buf.c    |    2 -
> >  fs/xfs/libxfs/xfs_sb.c           |   24 +++++---
> >  fs/xfs/libxfs/xfs_trans_resv.c   |   17 +++---
> >  fs/xfs/libxfs/xfs_trans_space.h  |    7 +-
> >  fs/xfs/libxfs/xfs_types.c        |    4 +
> >  fs/xfs/scrub/ialloc.c            |   22 ++++----
> >  fs/xfs/scrub/quota.c             |    2 -
> >  fs/xfs/xfs_fsops.c               |    4 +
> >  fs/xfs/xfs_inode.c               |   17 +++---
> >  fs/xfs/xfs_itable.c              |   11 ++--
> >  fs/xfs/xfs_log_recover.c         |   23 ++++----
> >  fs/xfs/xfs_mount.c               |   49 +++++++++--------
> >  fs/xfs/xfs_mount.h               |   17 ------
> >  fs/xfs/xfs_super.c               |    6 +-
> >  18 files changed, 205 insertions(+), 163 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 9bb3c48843ec..66f527b1c461 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1071,7 +1071,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define	XFS_INO_MASK(k)			(uint32_t)((1ULL << (k)) - 1)
> >  #define	XFS_INO_OFFSET_BITS(mp)		(mp)->m_sb.sb_inopblog
> >  #define	XFS_INO_AGBNO_BITS(mp)		(mp)->m_sb.sb_agblklog
> > -#define	XFS_INO_AGINO_BITS(mp)		(mp)->m_agino_log
> > +#define	XFS_INO_AGINO_BITS(mp)		((mp)->m_ino_geo.ig_agino_log)
> >  #define	XFS_INO_AGNO_BITS(mp)		(mp)->m_agno_log
> >  #define	XFS_INO_BITS(mp)		\
> >  	XFS_INO_AGNO_BITS(mp) + XFS_INO_AGINO_BITS(mp)
> > @@ -1694,4 +1694,35 @@ struct xfs_acl {
> >  #define SGI_ACL_FILE_SIZE	(sizeof(SGI_ACL_FILE)-1)
> >  #define SGI_ACL_DEFAULT_SIZE	(sizeof(SGI_ACL_DEFAULT)-1)
> >  
> > +struct xfs_ino_geometry {
> > +	/* Maximum inode count in this filesystem. */
> > +	uint64_t	ig_maxicount;
> 
> Naming is hard. What's the point of adding "ig_" prefix when the
> variables mostly already have an "i" somewhere in them that means
> "inode"?  And when they are referenced as igeo->ig_i...., then we've
> got so much redudant namespace in the code.....
> 
> This is one of the reasons when the struct xfs_da_geometry is not
> namespaced - it's clear from the code context it's
> directory/attribute geometry info, so it doesn't need lots of extra
> namespace gunk.

Ok, no more namespacing gunk.  Whoopee!! :)

> > +
> > +	/* Minimum inode buffer size, in bytes. */
> > +	unsigned int	ig_min_cluster_size;
> 
> What does the "minimum" in this variable mean? cluster size is fixed
> for a filesystem, there's no minimum or maximum size....

The comment came from xfs_mount.h ("min inode buf size"), which didn't
help a lot.  This variable and the two after have caused me quite a lot
of confusion over the years. :)

AFAICT, this value is some sort of "desired" cluster buffer size, which
for V4 filesystems is fixed at 8K and for V5 filesystems is calculated
as 8K * (inode size / 256)...

> > +	/* Inode cluster sizes, adjusted to be at least 1 fsb. */
> > +	unsigned int	ig_inodes_per_cluster;
> > +	unsigned int	ig_blocks_per_cluster;

...however, it's still possible for this "desired" cluster buffer size
to be less than a single fs block.  We don't support partial-block
buffers, so /most/ of the code uses ig_{inodes,blocks}_per_cluster or
xfs_icluster_size_fsb to walk through all the inodes in an actual inode
cluster buffer.

I'm not sure what to call ig_min_cluster_size (or its former name,
inode_cluster_size) since it's mostly aspirational.

In particular, if I fire up mkfs.xfs -m crc=1 -b size=64k -i size=512, I
see the following inode geometry:

(gdb) p mp->m_ino_geo
$1 = {
  maxicount = 1611776, 
  inode_cluster_size = 16384, 

Desired cluster size of 8192 * (512 / 256), or 16k.  However, we only
support 64k blocks, so..

  inodes_per_cluster = 128, 

...however, 65536/512 = 128, so this makes sense...

  blocks_per_cluster = 1, 

...1 block per cluster, as expected.

  cluster_align = 1, 
  cluster_align_inodes = 128, 
  inobt_mxr = {4092, 8185}, 
  inobt_mnr = {2046, 4092}, 
  inobt_maxlevels = 2, 
  ialloc_inos = 128, 
  ialloc_blks = 1, 
  ialloc_min_blks = 1, 
  inoalign_mask = 4294967295, 
  agino_log = 21, 
  sinoalign = 0
}

So we can't just use inodes_per_cluster as a stand-in for
inode_cluster_size >> inodelog, because they're not totally equivalent.

For comparison, this is what you get with a 4k block filesystem:

(gdb) p mp->m_ino_geo
$1 = {
  maxicount = 1611776, 
  inode_cluster_size = 16384, 
  inodes_per_cluster = 32, 
  blocks_per_cluster = 4, 
  cluster_align = 8, 
  cluster_align_inodes = 64, 
  inobt_mxr = {252, 505}, 
  inobt_mnr = {126, 252}, 
  inobt_maxlevels = 3, 
  ialloc_inos = 64, 
  ialloc_blks = 8, 
  ialloc_min_blks = 4, 
  inoalign_mask = 7, 
  agino_log = 21, 
  sinoalign = 0
}

In theory there shouldn't be /any/ users of inode_cluster_size except
for xfs_icluster_size_fsb() since it makes no sense to deal with a
partial inode cluster buffer, right?  With two exceptions, all users of
inode_cluster_size open-code rounding it up to at least 1FSB.  Those
cases can be converted to inodes_per_cluster.

However, there are those two cases that don't do that -- xfs_inobp_check
and xfs_iflush_cluster.  I don't see how either of these are correct for
64k-block filesystems?  xfs_inobp_check is a debugging function so maybe
it's less noticeable.

It seems to me that xfs_iflush_cluster only flushes the first part of an
inode cluster buffer when blocksizes are large?  On that 64k block
filesystem above, xfs_iflush will use xfs_iflush_cluster to see if there
are any other inodes that can be flushed out with the write, but since
inodes_cluster_size = 16384, it'll only look at the first 32 inodes in a
128-inode cluster buffer.  We probably never see any ill effects because
reclaim will eventually flush the other 96 inodes.

So I think a lot of these (inode_cluster_size >> inodelog) clauses can
be cleaned up, and I think there's a bug in xfs_iflush_cluster.  But
all that makes me hesitant to "just clean it all up", at least not
without someone else looking at this. :)

> > +
> > +	/* Inode cluster alignment. */
> > +	unsigned int	ig_cluster_align;
> > +	unsigned int	ig_cluster_align_inodes;
> > +
> > +	unsigned int	ig_inobt_mxr[2]; /* max inobt btree records */
> > +	unsigned int	ig_inobt_mnr[2]; /* min inobt btree records */
> > +	unsigned int	ig_in_maxlevels; /* max inobt btree levels. */
> 
> inobt_maxlevels?

Ok.

> > +
> > +	/* Minimum inode allocation size */
> > +	unsigned int	ig_ialloc_inos;
> > +	unsigned int	ig_ialloc_blks;
> 
> What's "minimum" about these values?

Hmm, nothing. :)

/* Size of inode allocations under normal operation */

> > +	/* Minimum inode blocks for a sparse allocation. */
> > +	unsigned int	ig_ialloc_min_blks;
> > +
> > +	unsigned int	ig_inoalign_mask;/* mask sb_inoalignmt if used */
> 
> This goes with the cluster alignment variables, it's always set by
> mkfs and used to convert inode numbers to cluster agbnos...

Ok.

> > +	unsigned int	ig_agino_log;	/* #bits for agino in inum */
> > +	unsigned int	ig_sinoalign;	/* stripe unit inode alignment */
> 
> And this one should be renamed ialloc_align and moved up with the
> the other ialloc variables....

Ok.

> > +};
> > +
> >  #endif /* __XFS_FORMAT_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index fe9898875097..c881e0521331 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -299,7 +299,7 @@ xfs_ialloc_inode_init(
> >  	 * sizes, manipulate the inodes in buffers  which are multiples of the
> >  	 * blocks size.
> >  	 */
> > -	nbufs = length / mp->m_blocks_per_cluster;
> > +	nbufs = length / mp->m_ino_geo.ig_blocks_per_cluster;
> >  
> >  	/*
> >  	 * Figure out what version number to use in the inodes we create.  If
> > @@ -343,9 +343,10 @@ xfs_ialloc_inode_init(
> >  		 * Get the block.
> >  		 */
> >  		d = XFS_AGB_TO_DADDR(mp, agno, agbno +
> > -				(j * mp->m_blocks_per_cluster));
> > +				(j * mp->m_ino_geo.ig_blocks_per_cluster));
> >  		fbuf = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
> > -					 mp->m_bsize * mp->m_blocks_per_cluster,
> > +					 mp->m_bsize *
> > +					 mp->m_ino_geo.ig_blocks_per_cluster,
> >  					 XBF_UNMAPPED);
> 
> This doesn't improve readability of the code. Please use a local
> igeom variable rather than repeatedly using mp->m_ino_geo.ig_....
> in the function.

Can I create an M_IGEO macro and replace mp->m_blocks_per_cluster with
M_IGEO(mp)->blocks_per_cluster instead?
> 
> 
> > @@ -690,10 +693,10 @@ xfs_ialloc_ag_alloc(
> >  		 * but not to use them in the actual exact allocation.
> >  		 */
> >  		args.alignment = 1;
> > -		args.minalignslop = args.mp->m_cluster_align - 1;
> > +		args.minalignslop = args.mp->m_ino_geo.ig_cluster_align - 1;
> 
> Ummm, why not igeo->... , like:
> 
> >  
> >  		/* Allow space for the inode btree to split. */
> > -		args.minleft = args.mp->m_in_maxlevels - 1;
> > +		args.minleft = igeo->ig_in_maxlevels - 1;
> 
> 3 lines down?

djwong drain bamage. :(

> >  		if ((error = xfs_alloc_vextent(&args)))
> >  			return error;
> >  
> > @@ -720,12 +723,12 @@ xfs_ialloc_ag_alloc(
> >  		 * pieces, so don't need alignment anyway.
> >  		 */
> >  		isaligned = 0;
> > -		if (args.mp->m_sinoalign) {
> > +		if (igeo->ig_sinoalign) {
> >  			ASSERT(!(args.mp->m_flags & XFS_MOUNT_NOALIGN));
> >  			args.alignment = args.mp->m_dalign;
> >  			isaligned = 1;
> >  		} else
> > -			args.alignment = args.mp->m_cluster_align;
> > +			args.alignment = args.mp->m_ino_geo.ig_cluster_align;
> 
> Ditto (and others).

Will fix...

> >  	int			noroom = 0;
> >  	xfs_agnumber_t		start_agno;
> >  	struct xfs_perag	*pag;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	int			okalloc = 1;
> >  
> >  	if (*IO_agbp) {
> > @@ -1733,9 +1737,9 @@ xfs_dialloc(
> >  	 * Read rough value of mp->m_icount by percpu_counter_read_positive,
> >  	 * which will sacrifice the preciseness but improve the performance.
> >  	 */
> > -	if (mp->m_maxicount &&
> > -	    percpu_counter_read_positive(&mp->m_icount) + mp->m_ialloc_inos
> > -							> mp->m_maxicount) {
> > +	if (mp->m_ino_geo.ig_maxicount &&
> 
> igeo?

All of...

> > +	    percpu_counter_read_positive(&mp->m_icount) + igeo->ig_ialloc_inos
> > +							> igeo->ig_maxicount) {
> >  		noroom = 1;
> >  		okalloc = 0;
> >  	}
> > @@ -1852,7 +1856,8 @@ xfs_difree_inode_chunk(
> >  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
> >  		/* not sparse, calculate extent info directly */
> >  		xfs_bmap_add_free(tp, XFS_AGB_TO_FSB(mp, agno, sagbno),
> > -				  mp->m_ialloc_blks, &XFS_RMAP_OINFO_INODES);
> > +				  mp->m_ino_geo.ig_ialloc_blks,
> > +				  &XFS_RMAP_OINFO_INODES);
> >  		return;
> >  	}
> >  
> > @@ -2261,7 +2266,7 @@ xfs_imap_lookup(
> >  
> >  	/* check that the returned record contains the required inode */
> >  	if (rec.ir_startino > agino ||
> > -	    rec.ir_startino + mp->m_ialloc_inos <= agino)
> > +	    rec.ir_startino + mp->m_ino_geo.ig_ialloc_inos <= agino)
> >  		return -EINVAL;
> >  
> >  	/* for untrusted inodes check it is allocated first */
> > @@ -2352,7 +2357,7 @@ xfs_imap(
> >  	 * If the inode cluster size is the same as the blocksize or
> >  	 * smaller we get to the buffer by simple arithmetics.
> >  	 */
> > -	if (mp->m_blocks_per_cluster == 1) {
> > +	if (mp->m_ino_geo.ig_blocks_per_cluster == 1) {
> 
> igeo...

These dorky...

> >  		offset = XFS_INO_TO_OFFSET(mp, ino);
> >  		ASSERT(offset < mp->m_sb.sb_inopblock);
> >  
> > @@ -2368,8 +2373,8 @@ xfs_imap(
> >  	 * find the location. Otherwise we have to do a btree
> >  	 * lookup to find the location.
> >  	 */
> > -	if (mp->m_inoalign_mask) {
> > -		offset_agbno = agbno & mp->m_inoalign_mask;
> > +	if (mp->m_ino_geo.ig_inoalign_mask) {
> > +		offset_agbno = agbno & mp->m_ino_geo.ig_inoalign_mask;
> 
> and here too.

little...

> >  		chunk_agbno = agbno - offset_agbno;
> >  	} else {
> >  		error = xfs_imap_lookup(mp, tp, agno, agino, agbno,
> > @@ -2381,13 +2386,13 @@ xfs_imap(
> >  out_map:
> >  	ASSERT(agbno >= chunk_agbno);
> >  	cluster_agbno = chunk_agbno +
> > -		((offset_agbno / mp->m_blocks_per_cluster) *
> > -		 mp->m_blocks_per_cluster);
> > +		((offset_agbno / mp->m_ino_geo.ig_blocks_per_cluster) *
> > +		 mp->m_ino_geo.ig_blocks_per_cluster);
> 
> And here.

omissions and...

> >  	offset = ((agbno - cluster_agbno) * mp->m_sb.sb_inopblock) +
> >  		XFS_INO_TO_OFFSET(mp, ino);
> >  
> >  	imap->im_blkno = XFS_AGB_TO_DADDR(mp, agno, cluster_agbno);
> > -	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
> > +	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_ino_geo.ig_blocks_per_cluster);
> 
> and here...

...untidy bits.

> >  	imap->im_boffset = (unsigned short)(offset << mp->m_sb.sb_inodelog);
> >  
> >  	/*
> > @@ -2409,7 +2414,7 @@ xfs_imap(
> >  }
> >  
> >  /*
> > - * Compute and fill in value of m_in_maxlevels.
> > + * Compute and fill in value of m_ino_geo.ig_in_maxlevels.
> >   */
> >  void
> >  xfs_ialloc_compute_maxlevels(
> > @@ -2418,8 +2423,8 @@ xfs_ialloc_compute_maxlevels(
> >  	uint		inodes;
> >  
> >  	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
> > -	mp->m_in_maxlevels = xfs_btree_compute_maxlevels(mp->m_inobt_mnr,
> > -							 inodes);
> > +	mp->m_ino_geo.ig_in_maxlevels = xfs_btree_compute_maxlevels(
> > +			mp->m_ino_geo.ig_inobt_mnr, inodes);
> 
> 
> Once we take away the macro:
> 
> 	inode = (1LL << igeo->agino_log) >> XFS_INODES_PER_CHUNK_LOG
> 	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
> 							inodes);
> 
> So, shouldn't we just pass igeo into this function now?

Yeah.

> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
> > index e936b7cc9389..b74fa2addd51 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.h
> > +++ b/fs/xfs/libxfs/xfs_ialloc.h
> > @@ -28,9 +28,9 @@ static inline int
> >  xfs_icluster_size_fsb(
> >  	struct xfs_mount	*mp)
> >  {
> > -	if (mp->m_sb.sb_blocksize >= mp->m_inode_cluster_size)
> > +	if (mp->m_sb.sb_blocksize >= mp->m_ino_geo.ig_min_cluster_size)
> >  		return 1;
> > -	return mp->m_inode_cluster_size >> mp->m_sb.sb_blocklog;
> > +	return mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_blocklog;
> >  }
> 
> The return value of this is placed in the mp->m_ino_geo structure.
> This should pass in the igeo structure the result is written into.
> It's other caller should be using the value in the igeo structure,
> not calling this function.

Ok.

> > index bc2dfacd2f4a..79cc5cf21e1b 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -28,7 +28,7 @@ xfs_inobt_get_minrecs(
> >  	struct xfs_btree_cur	*cur,
> >  	int			level)
> >  {
> > -	return cur->bc_mp->m_inobt_mnr[level != 0];
> > +	return cur->bc_mp->m_ino_geo.ig_inobt_mnr[level != 0];
> >  }
> 
> Put a igeo pointer in the inobt union of the btree cursor?
> 
> 	return cur->bc_private.a.igeo->inobt_mnr[level != 0];

Or just M_IGEO(cur->bc_mp)->inobt_mtr[level != 0]; ?

> >  
> >  STATIC struct xfs_btree_cur *
> > @@ -164,7 +164,7 @@ xfs_inobt_get_maxrecs(
> >  	struct xfs_btree_cur	*cur,
> >  	int			level)
> >  {
> > -	return cur->bc_mp->m_inobt_mxr[level != 0];
> > +	return cur->bc_mp->m_ino_geo.ig_inobt_mxr[level != 0];
> >  }
> >  
> >  STATIC void
> > @@ -281,10 +281,11 @@ xfs_inobt_verify(
> >  
> >  	/* level verification */
> >  	level = be16_to_cpu(block->bb_level);
> > -	if (level >= mp->m_in_maxlevels)
> > +	if (level >= mp->m_ino_geo.ig_in_maxlevels)
> >  		return __this_address;
> >  
> > -	return xfs_btree_sblock_verify(bp, mp->m_inobt_mxr[level != 0]);
> > +	return xfs_btree_sblock_verify(bp,
> > +			mp->m_ino_geo.ig_inobt_mxr[level != 0]);
> >  }
> >  
> >  static void
> > @@ -546,7 +547,7 @@ xfs_inobt_max_size(
> >  	xfs_agblock_t		agblocks = xfs_ag_block_count(mp, agno);
> >  
> >  	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > -	if (mp->m_inobt_mxr[0] == 0)
> > +	if (mp->m_ino_geo.ig_inobt_mxr[0] == 0)
> >  		return 0;
> >  
> >  	/*
> > @@ -558,7 +559,7 @@ xfs_inobt_max_size(
> >  	    XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart) == agno)
> >  		agblocks -= mp->m_sb.sb_logblocks;
> >  
> > -	return xfs_btree_calc_size(mp->m_inobt_mnr,
> > +	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr,
> >  				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
> >  					XFS_INODES_PER_CHUNK);
> >  }
> > @@ -619,5 +620,5 @@ xfs_iallocbt_calc_size(
> >  	struct xfs_mount	*mp,
> >  	unsigned long long	len)
> >  {
> > -	return xfs_btree_calc_size(mp->m_inobt_mnr, len);
> > +	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr, len);
> 
> Should pass igeo into this function now, not xfs_mount.

Ok.

> >  }
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index e021d5133ccb..641aa1c2f1ae 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -36,7 +36,7 @@ xfs_inobp_check(
> >  	int		j;
> >  	xfs_dinode_t	*dip;
> >  
> > -	j = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
> > +	j = mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_inodelog;
> 
> isn't that "inodes per cluster"?

Yes.  However, I think I want to keep the "move everything to structure"
in one patch and make a new one for "convert all the opencoded igeo bits".

> >  
> >  	for (i = 0; i < j; i++) {
> >  		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index e76a3e5d28d7..9416fc741788 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -804,16 +804,18 @@ const struct xfs_buf_ops xfs_sb_quiet_buf_ops = {
> >   */
> >  void
> >  xfs_sb_mount_common(
> > -	struct xfs_mount *mp,
> > -	struct xfs_sb	*sbp)
> > +	struct xfs_mount	*mp,
> > +	struct xfs_sb		*sbp)
> >  {
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> > +
> >  	mp->m_agfrotor = mp->m_agirotor = 0;
> >  	mp->m_maxagi = mp->m_sb.sb_agcount;
> >  	mp->m_blkbit_log = sbp->sb_blocklog + XFS_NBBYLOG;
> >  	mp->m_blkbb_log = sbp->sb_blocklog - BBSHIFT;
> >  	mp->m_sectbb_log = sbp->sb_sectlog - BBSHIFT;
> >  	mp->m_agno_log = xfs_highbit32(sbp->sb_agcount - 1) + 1;
> > -	mp->m_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
> > +	mp->m_ino_geo.ig_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
> 
> igeo.
> 
> >  
> > @@ -307,7 +308,8 @@ xfs_calc_iunlink_remove_reservation(
> >  	struct xfs_mount        *mp)
> >  {
> >  	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> > -	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1), mp->m_inode_cluster_size);
> > +	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1),
> > +			 mp->m_ino_geo.ig_min_cluster_size);
> >  }
> 
> I'm starting to think a M_IGEO(mp)-like macro is in order here....

Already done.

> >  
> > diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
> > index 9b47117180cb..fa7386bf76e9 100644
> > --- a/fs/xfs/scrub/ialloc.c
> > +++ b/fs/xfs/scrub/ialloc.c
> > @@ -230,7 +230,7 @@ xchk_iallocbt_check_cluster(
> >  	int				error = 0;
> >  
> >  	nr_inodes = min_t(unsigned int, XFS_INODES_PER_CHUNK,
> > -			mp->m_inodes_per_cluster);
> > +			mp->m_ino_geo.ig_inodes_per_cluster);
> 
> igeo.... (many uses in this function)
> 
> > @@ -355,6 +356,7 @@ xchk_iallocbt_rec_alignment(
> >  {
> >  	struct xfs_mount		*mp = bs->sc->mp;
> >  	struct xchk_iallocbt		*iabt = bs->private;
> > +	struct xfs_ino_geometry		*ig = &mp->m_ino_geo;
> 
> igeo, for consistency with the rest of the code.
> 
> > @@ -2567,7 +2568,8 @@ xfs_ifree_cluster(
> >  		 * to mark all the active inodes on the buffer stale.
> >  		 */
> >  		bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
> > -					mp->m_bsize * mp->m_blocks_per_cluster,
> > +					mp->m_bsize *
> > +						igeo->ig_blocks_per_cluster,
> >  					XBF_UNMAPPED);
> 
> Back off the indent, don't use another line :)

Ok.

> > @@ -3476,19 +3478,20 @@ xfs_iflush_cluster(
> >  	int			cilist_size;
> >  	struct xfs_inode	**cilist;
> >  	struct xfs_inode	*cip;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	int			nr_found;
> >  	int			clcount = 0;
> >  	int			i;
> >  
> >  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> >  
> > -	inodes_per_cluster = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
> > +	inodes_per_cluster = igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog;
> 
> that's igeo->inodes_per_cluster again, right?

Yep.  I think the fact that we don't adjust m_inode_cluster_size to
match what xfs_icluster_size_fsb spits out means that the usage here is
incorrect, but we can fix in a subsequent patch.

> >  	cilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
> >  	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
> >  	if (!cilist)
> >  		goto out_put;
> >  
> > -	mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
> > +	mask = ~(((igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
> 
> Isn't that:
> 
> 	mask = ~(inodes_per_cluster - 1);

Yep.

> >  	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> >  	rcu_read_lock();
> >  	/* really need a gang lookup range call here */
> > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > index 1e1a0af1dd34..cff28ee73deb 100644
> > --- a/fs/xfs/xfs_itable.c
> > +++ b/fs/xfs/xfs_itable.c
> > @@ -167,6 +167,7 @@ xfs_bulkstat_ichunk_ra(
> >  	xfs_agnumber_t			agno,
> >  	struct xfs_inobt_rec_incore	*irec)
> >  {
> > +	struct xfs_ino_geometry		*igeo = &mp->m_ino_geo;
> >  	xfs_agblock_t			agbno;
> >  	struct blk_plug			plug;
> >  	int				i;	/* inode chunk index */
> > @@ -174,12 +175,14 @@ xfs_bulkstat_ichunk_ra(
> >  	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
> >  
> >  	blk_start_plug(&plug);
> > -	for (i = 0; i < XFS_INODES_PER_CHUNK;
> > -	     i += mp->m_inodes_per_cluster, agbno += mp->m_blocks_per_cluster) {
> > -		if (xfs_inobt_maskn(i, mp->m_inodes_per_cluster) &
> > +	for (i = 0;
> > +	     i < XFS_INODES_PER_CHUNK;
> > +	     i += igeo->ig_inodes_per_cluster,
> > +			agbno += igeo->ig_blocks_per_cluster) {
> > +		if (xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster) &
> >  		    ~irec->ir_free) {
> >  			xfs_btree_reada_bufs(mp, agno, agbno,
> > -					mp->m_blocks_per_cluster,
> > +					igeo->ig_blocks_per_cluster,
> >  					&xfs_inode_buf_ops);
> >  		}
> 
> That's a mess :(
> 
> 	for (i = 0; i < XFS_INODES_PER_CHUNK; i += igeo->inodes_per_cluster) {
> 		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
> 							~irec->ir_free) {
> 			xfs_btree_reada_bufs(mp, agno, agbno,
> 					igeo->ig_blocks_per_cluster,
> 					&xfs_inode_buf_ops);
> 		}
> 		agbno += igeo->blocks_per_cluster;

<nod> I'll clean it up some more.

> 	}
> 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 9329f5adbfbe..15118e531184 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2882,19 +2882,19 @@ xlog_recover_buffer_pass2(
> >  	 *
> >  	 * Also make sure that only inode buffers with good sizes stay in
> >  	 * the buffer cache.  The kernel moves inodes in buffers of 1 block
> > -	 * or mp->m_inode_cluster_size bytes, whichever is bigger.  The inode
> > +	 * or ig_min_cluster_size bytes, whichever is bigger.  The inode
> >  	 * buffers in the log can be a different size if the log was generated
> >  	 * by an older kernel using unclustered inode buffers or a newer kernel
> >  	 * running with a different inode cluster size.  Regardless, if the
> > -	 * the inode buffer size isn't max(blocksize, mp->m_inode_cluster_size)
> > -	 * for *our* value of mp->m_inode_cluster_size, then we need to keep
> > +	 * the inode buffer size isn't max(blocksize, ig_min_cluster_size)
> > +	 * for *our* value of ig_min_cluster_size, then we need to keep
> >  	 * the buffer out of the buffer cache so that the buffer won't
> >  	 * overlap with future reads of those inodes.
> >  	 */
> >  	if (XFS_DINODE_MAGIC ==
> >  	    be16_to_cpu(*((__be16 *)xfs_buf_offset(bp, 0))) &&
> >  	    (BBTOB(bp->b_io_length) != max(log->l_mp->m_sb.sb_blocksize,
> > -			(uint32_t)log->l_mp->m_inode_cluster_size))) {
> > +			(uint32_t)log->l_mp->m_ino_geo.ig_min_cluster_size))) {
> 
> cluster size is already an unsigned int so the cast ca go.

Ok.

> >  		xfs_buf_stale(bp);
> >  		error = xfs_bwrite(bp);
> >  	} else {
> > @@ -3849,6 +3849,7 @@ xlog_recover_do_icreate_pass2(
> >  {
> >  	struct xfs_mount	*mp = log->l_mp;
> >  	struct xfs_icreate_log	*icl;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	xfs_agnumber_t		agno;
> >  	xfs_agblock_t		agbno;
> >  	unsigned int		count;
> > @@ -3898,10 +3899,10 @@ xlog_recover_do_icreate_pass2(
> >  
> >  	/*
> >  	 * The inode chunk is either full or sparse and we only support
> > -	 * m_ialloc_min_blks sized sparse allocations at this time.
> > +	 * m_ino_geo.ig_ialloc_min_blks sized sparse allocations at this time.
> >  	 */
> > -	if (length != mp->m_ialloc_blks &&
> > -	    length != mp->m_ialloc_min_blks) {
> > +	if (length != igeo->ig_ialloc_blks &&
> > +	    length != igeo->ig_ialloc_min_blks) {
> >  		xfs_warn(log->l_mp,
> >  			 "%s: unsupported chunk length", __FUNCTION__);
> >  		return -EINVAL;
> > @@ -3921,13 +3922,13 @@ xlog_recover_do_icreate_pass2(
> >  	 * buffers for cancellation so we don't overwrite anything written after
> >  	 * a cancellation.
> >  	 */
> > -	bb_per_cluster = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
> > -	nbufs = length / mp->m_blocks_per_cluster;
> > +	bb_per_cluster = XFS_FSB_TO_BB(mp, igeo->ig_blocks_per_cluster);
> > +	nbufs = length / igeo->ig_blocks_per_cluster;
> >  	for (i = 0, cancel_count = 0; i < nbufs; i++) {
> >  		xfs_daddr_t	daddr;
> >  
> > -		daddr = XFS_AGB_TO_DADDR(mp, agno,
> > -					 agbno + i * mp->m_blocks_per_cluster);
> > +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno +
> > +				i * igeo->ig_blocks_per_cluster);
> 
> makes no sense to change the line break location.
> 
> 		daddr = XFS_AGB_TO_DADDR(mp, agno,
> 				agbno + i * igeo->ig_blocks_per_cluster);

Ok.

> 
> 
> >   */
> >  STATIC void
> > -xfs_set_maxicount(xfs_mount_t *mp)
> > +xfs_set_maxicount(
> > +	struct xfs_mount	*mp)
> >  {
> > -	xfs_sb_t	*sbp = &(mp->m_sb);
> > -	uint64_t	icount;
> > +	struct xfs_sb		*sbp = &(mp->m_sb);
> 
> kill the ().
> 
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> > +	uint64_t		icount;
> >  
> >  	if (sbp->sb_imax_pct) {
> >  		/*
> > @@ -445,11 +447,11 @@ xfs_set_maxicount(xfs_mount_t *mp)
> >  		 */
> >  		icount = sbp->sb_dblocks * sbp->sb_imax_pct;
> >  		do_div(icount, 100);
> > -		do_div(icount, mp->m_ialloc_blks);
> > -		mp->m_maxicount = (icount * mp->m_ialloc_blks)  <<
> > -				   sbp->sb_inopblog;
> > +		do_div(icount, igeo->ig_ialloc_blks);
> > +		igeo->ig_maxicount = XFS_FSB_TO_INO(mp,
> > +				icount * igeo->ig_ialloc_blks);
> >  	} else {
> > -		mp->m_maxicount = 0;
> > +		igeo->ig_maxicount = 0;
> >  	}
> >  }
> >  
> > @@ -518,18 +520,18 @@ xfs_set_inoalignment(xfs_mount_t *mp)
> >  {
> >  	if (xfs_sb_version_hasalign(&mp->m_sb) &&
> >  		mp->m_sb.sb_inoalignmt >= xfs_icluster_size_fsb(mp))
> > -		mp->m_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
> > +		mp->m_ino_geo.ig_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
> >  	else
> > -		mp->m_inoalign_mask = 0;
> > +		mp->m_ino_geo.ig_inoalign_mask = 0;
> >  	/*
> >  	 * If we are using stripe alignment, check whether
> >  	 * the stripe unit is a multiple of the inode alignment
> >  	 */
> > -	if (mp->m_dalign && mp->m_inoalign_mask &&
> > -	    !(mp->m_dalign & mp->m_inoalign_mask))
> > -		mp->m_sinoalign = mp->m_dalign;
> > +	if (mp->m_dalign && mp->m_ino_geo.ig_inoalign_mask &&
> > +	    !(mp->m_dalign & mp->m_ino_geo.ig_inoalign_mask))
> > +		mp->m_ino_geo.ig_sinoalign = mp->m_dalign;
> >  	else
> > -		mp->m_sinoalign = 0;
> > +		mp->m_ino_geo.ig_sinoalign = 0;
> 
> should pass in igeo to this function....

Ok.

> >  }
> >  
> >  /*
> > @@ -683,6 +685,7 @@ xfs_mountfs(
> >  {
> >  	struct xfs_sb		*sbp = &(mp->m_sb);
> >  	struct xfs_inode	*rip;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	uint64_t		resblks;
> >  	uint			quotamount = 0;
> >  	uint			quotaflags = 0;
> > @@ -797,18 +800,20 @@ xfs_mountfs(
> >  	 * has set the inode alignment value appropriately for larger cluster
> >  	 * sizes.
> >  	 */
> > -	mp->m_inode_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
> > +	igeo->ig_min_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
> >  	if (xfs_sb_version_hascrc(&mp->m_sb)) {
> > -		int	new_size = mp->m_inode_cluster_size;
> > +		int	new_size = igeo->ig_min_cluster_size;
> >  
> >  		new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
> >  		if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
> > -			mp->m_inode_cluster_size = new_size;
> > +			igeo->ig_min_cluster_size = new_size;
> >  	}
> > -	mp->m_blocks_per_cluster = xfs_icluster_size_fsb(mp);
> > -	mp->m_inodes_per_cluster = XFS_FSB_TO_INO(mp, mp->m_blocks_per_cluster);
> > -	mp->m_cluster_align = xfs_ialloc_cluster_alignment(mp);
> > -	mp->m_cluster_align_inodes = XFS_FSB_TO_INO(mp, mp->m_cluster_align);
> > +	igeo->ig_blocks_per_cluster = xfs_icluster_size_fsb(mp);
> > +	igeo->ig_inodes_per_cluster = XFS_FSB_TO_INO(mp,
> > +			igeo->ig_blocks_per_cluster);
> > +	igeo->ig_cluster_align = xfs_ialloc_cluster_alignment(mp);
> > +	igeo->ig_cluster_align_inodes = XFS_FSB_TO_INO(mp,
> > +			igeo->ig_cluster_align);
> 
> Can we separate out all the igeo initialsation into a single init
> function rather than being spread out over several functions?

I'll try it out.  The current spaghetti is pretty gross.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx