On Fri, Nov 08, 2013 at 12:21:20PM -0600, Eric Sandeen wrote: > On 10/31/13, 11:27 PM, Dave Chinner wrote: > > So, this patch removes most of the performance and CPU usage > > differential between v4 and v5 filesystems on traversal related > > workloads. > > We already have some issues w/ larger inode clusters & freespace > fragmentation resulting in the inability to create new clusters, > right? Yes, we do. > How might this impact that problem? I understand that it may be > a tradeoff. It will make the problem slightly worse for workloads that already suffer from this problem. For those that don't, it should have no noticeable effect. > > @@ -718,8 +719,22 @@ xfs_mountfs( > > * Set the inode cluster size. > > * This may still be overridden by the file system > > * block size if it is larger than the chosen cluster size. > > + * > > + * For v5 filesystems, scale the cluster size with the inode size to > > + * keep a constant ratio of inode per cluster buffer, but only if mkfs > > + * has set the inode alignment value appropriately for larger cluster > > + * sizes. > > */ > > mp->m_inode_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE; > > Just thinking out loud here: So this is runtime only; nothing on disk sets > the cluster size explicitly (granted, it never did). > > So moving back and forth across newer/older kernels will create clusters > of different sizes on the same filesystem, right? No - inodes are allocated in chunks, not clusters. Inode clusters are the unit of IO we read and write inodes in. > (In the very distant past, this same change could have happened if > the amount of memory in a box changed (!) - see commit > 425f9ddd534573f58df8e7b633a534fcfc16d44d; prior to that we set > m_inode_cluster_size on the fly as well). Right, I think I've already pointed that out. > But sb_inoalignmt is a mkfs-set, on-disk feature. So we might start with > i.e. this, where A1 are 8k alignment points, and 512 byte inodes, in clusters > of size 8k / 16 inodes: > > A1 A1 A1 A1 > [ 16 inodes ][ 16 inodes ] [ 16 inodes ] Ok, here's where you go wrong. Inode chunks are always 64 inodes, and so what you have on disk after any inode allocation is: A1 A1 A1 A1 [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ] and sb_inoalign determines where A1 lands in terms of filesystem blocks. With sb_inoalign = 2 and a 4k filesystem block size, you can only align inode *chunks* to even filesystem blocks like so: ODD EVEN ODD EVEN ODD EVEN ODD EVEN ODD EVEN A1 A1 A1 A1 A1 [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ] If we have 1kb filesystem blocks, then the equivalent sb_inoalign value to give this same inode *chunk* layout is 8: # mkfs.xfs -f -b size=1024 -i size=512 /dev/vdb ..... # xfs_db -c "sb 0" -c "p inoalignmt" /dev/vdb inoalignmt = 8 # i.e: 45 6 70 1 2 345 6 70 1 2 345 6 70 1 2 345 6 70 1 2 345 6 70 1 2 ... A1 A1 A1 A1 A1 [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ] And with the larger inode cluster sizes: # mkfs.xfs -f -b size=1024 -i size=512 -m crc=1 /dev/vdb ..... # xfs_db -c "sb 0" -c "p inoalignmt" /dev/vdb inoalignmt = 16 # And, yes, that's not actually out of the range we commonly test - 512 byte block size with 256 byte inodes: # mkfs.xfs -f -b size=512 /dev/vdb ..... # xfs_db -c "sb 0" -c "p inoalignmt" /dev/vdb inoalignmt = 16 # So we definitely already handle and test these inode alignment configurations all the time.... > and in this case we couldn't bump up m_inode_cluster_size, lest we > allocate a larger cluster on the smaller alignment & overlap: > > A1 A1 A1 A1 > [ 16 inodes ][ 16 inodes ] [ 16 inodes ] <--- existing > [ 32 inodes ] <--- new To be able to bump up the inode cluster size, what we have to guarantee is that the inode chunks align to the the larger cluster size like so: A2 A2 A1 A1 A1 A1 [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ] <--- existing [ 32 inodes ][ 32 inodes ] <--- new i.e. inode chunk allocation needs to be aligned to A2, not A1 for the correct alignment of the larger clusters. If we align to A1, then this will happen: A2 A2 A2 A1 A1 A1 A1 A1 [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ] [ 32 inodes ][ 32 inodes ] <--- new And that is clearly broken. Hence, to ensure we can use larger inode clusters, we have to ensure that the inode chunks are aligned appropriately for those cluster sizes. If the chunks are appropriately aligned for larger inode clusters (e.g. sb_inoalign = 4), then they are also appropriately aligned for inode cluster sizes older kernels support. > So the only other thing I wonder about is when we are handling > pre-existing, smaller-than m_inode_cluster_size clusters. > > i.e. xfs_ifree_cluster() figures out the number of blocks & > number of inodes in a cluster, based on the (now not > constant) m_inode_cluster_size. > > What stops us from going off the end of a smaller cluster? The fact that we calculate the number of inodes to process per cluster based on the size of the cluster buffer (in blocks) multiplied by the number of inodes per block. If the code didn't work, we'd have found out a long time ago ;) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs