Re: Clarification on inode alignment

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 24 Feb 2017 11:08:52 +1100

[Please use david@xxxxxxxxxxxxx if you want my attention for
upstream stuff]

On Thu, Feb 23, 2017 at 11:51:12AM +0530, Chandan Rajendra wrote:
> Hi Dave,
> 
> Last week, during a discussion on #xfs you stated that,
> 
> "inode alignment is necessary to avoid needing to do an inobt lookup to find
> the location of the inode every time we map an inode to disk".
> 
> Also, The following statement appears at
> http://xfs.org/index.php/Improving_inode_Caching,
> 
> "The main problem we have is that XFS uses inode chunk size and alignment to
> optimise inode number to disk location conversion. That is, the conversion
> becomes a single set of shifts and masks instead of an AGI btree lookup. This
> optimisation substantially reduces the CPU and I/O overhead of inode lookups,
> but it does limit our flexibility. If we break the alignment restriction,
> every lookup has to go back to a btree search. Hence we really want to avoid
> breaking chunk alignment and size rules."
> 
> For the 4k block size scenario, I noticed that we have inode alignment of 4
> blocks.

For some filesystem configurations. Not all.

> I did go through macros such as XFS_INO_TO_AGNO(),
> XFS_INO_TO_AGBNO() and XFS_INO_TO_OFFSET() which deal with extracting the
> components from an inode number.
> 
> However I still don't understand how non-aligned inode clusters (64 inodes in
> a single cluster) would break the inode number to disk location
> arithmetic calculations. Could you please explain it?

First, you need to understand the correct terminology: "inode
cluster" vs "inode chunk"

inode chunk: unit of inode allocation, always 64 contiguous inodes
		referenced by a single inobt record

inode cluster: a buffer used to do inode IO, whose size is dependent
		on superblock flags and field values

The inobt record records the AGBNO of the inode chunk, and
internally then indexes the inodes in the chunk from 0 to 63.

Inode clusters are independent of the inobt record. The number of
inodes in a cluster buffer is dependent on the inode size and the
inode cluster buffer size. The size of the inode cluster buffer is
dependent on filesystem block size and inode alignment.

Take, for example, v4 filesystem, 4k fsb, 256 byte inodes, with
everything indexed to zero to show that the maths to derive
everything from chunk_agbno + ino # is simple:

chunk @ agbno
	+-------------------------------------------+
ino#	0          16	      32         48

block	0	   1	      2          3
	+----------+----------+----------+----------+
agbno   +0	   +1	      +2	 +3

cluster	0		      1
	+---------------------+---------------------+
agbno	+0		      +2

All nice an simple, yes? So we can clearly see that an inode number
of AGBNO | INO# can be mapped to the physical block

	chunk_agbno + (INO# / inodes per block)

And the cluster buffer physical location is:

	chunk_agbno + (INO# / inodes per cluster)

Ok, so the math is simple (as you've noticed), but it doesn't
explain the alignment constraints. The question is this:
what assumption does this math make about the relationship
between the inode number and the physical location of the inode?

....

....

That's right, it assumes that chunk_agbno + INO# can only map to a
single physical location and so never overlaps with another inode
chunk. i.e. this cannot happen as a result of a inode free/alloc
operation pair:

Free:
chunk @ agbno
	+-------------------------------------------+
ino#	0          16	      32         48

Alloc:
chunk @ agbno+3
					+-------------------------------------------+
				ino#	0          16	      32         48

If we have the overlapping chunk allocation ranges like this, then
we can have multiple inode numbers that map to the same physical
location.  in the above case, both of the inode numbers (agbno | 48)
and (agbno + 3 | 0) map to the same physical location but they have
different cluster buffer address (i.e. agbno+2 vs agbno+3)

So, when you get an inode number, how do you know it is valid and
you haven't raced with an unlink that just removed the underlying
inode chunk? You can do a buffer lookup to see if it's stale, but
that has all sorts of problems in that a key constraint is that we
must not have overlapping buffers in the cache.  How do we know what
buffers we need to look up (and how do we do it in a race free
manner) to ensure that all the original inode cluster buffers have
been invalidated and their transactions committed during an
allocation?

IOWs, without jumping through all sorts of cluster buffer coherence
validation hoops we end up with free vs allocation and free vs
lookup races on the cluster buffers if we just use inode number
conversions for physical buffer mapping. That's complex, costly and
extremely error prone, so we essentially have to treat all inode
numbers as untrusted because of these races.

There are two ways to solve this problem.
	1) always look up the inobt record for an inode number to
	get the chunk_agbno from the inobt as locking the AGI for
	lookup guarantees no alloc/lookup/free races can occur; or

	2) Ensure that inode chunks never overlap by physically
	aligning them at allocation time, hence ensuring that every
	physical address maps to exactly one inode number and
	cluster buffer address.

XFS implemented 1) back in 1994 when inode cluster buffers were
introduced.  The issue with this is that inobt lookups every time we
want to map an inode number is that it is excitingly expensive. If
we know the inode number is correct (i.e. cames from other internal
metadata that we've already validated), then this is overhead we can
avoid if we constraint the disk format via method 2).

That was done more than 20 years ago:

commit 07d3e5d3764a8cf02d2e40397da0018c5c60f70a
Author: Doug Doucette <doucette@xxxxxxxxxxxx>
Date:   Tue Jun 4 19:08:18 1996 +0000

    Support for aligned inode allocation (bug 385316).  Support for
    superblock versioning (bug 385292).  Some cleanup.

And we've used aligned inodes ever since....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html