On Friday, February 24, 2017 11:08:52 AM Dave Chinner wrote: > > However I still don't understand how non-aligned inode clusters (64 inodes in > > a single cluster) would break the inode number to disk location > > arithmetic calculations. Could you please explain it? > > First, you need to understand the correct terminology: "inode > cluster" vs "inode chunk" > > inode chunk: unit of inode allocation, always 64 contiguous inodes > referenced by a single inobt record > > inode cluster: a buffer used to do inode IO, whose size is dependent > on superblock flags and field values > > The inobt record records the AGBNO of the inode chunk, and > internally then indexes the inodes in the chunk from 0 to 63. > > Inode clusters are independent of the inobt record. The number of > inodes in a cluster buffer is dependent on the inode size and the > inode cluster buffer size. The size of the inode cluster buffer is > dependent on filesystem block size and inode alignment. > > Take, for example, v4 filesystem, 4k fsb, 256 byte inodes, with > everything indexed to zero to show that the maths to derive > everything from chunk_agbno + ino # is simple: > > chunk @ agbno > +-------------------------------------------+ > ino# 0 16 32 48 > > block 0 1 2 3 > +----------+----------+----------+----------+ > agbno +0 +1 +2 +3 > > cluster 0 1 > +---------------------+---------------------+ > agbno +0 +2 > > All nice an simple, yes? So we can clearly see that an inode number > of AGBNO | INO# can be mapped to the physical block > > chunk_agbno + (INO# / inodes per block) > > And the cluster buffer physical location is: > > chunk_agbno + (INO# / inodes per cluster) > > Ok, so the math is simple (as you've noticed), but it doesn't > explain the alignment constraints. The question is this: > what assumption does this math make about the relationship > between the inode number and the physical location of the inode? > > .... > > .... > > That's right, it assumes that chunk_agbno + INO# can only map to a > single physical location and so never overlaps with another inode > chunk. i.e. this cannot happen as a result of a inode free/alloc > operation pair: > > Free: > chunk @ agbno > +-------------------------------------------+ > ino# 0 16 32 48 > > Alloc: > chunk @ agbno+3 > +-------------------------------------------+ > ino# 0 16 32 48 > > If we have the overlapping chunk allocation ranges like this, then > we can have multiple inode numbers that map to the same physical > location. in the above case, both of the inode numbers (agbno | 48) > and (agbno + 3 | 0) map to the same physical location but they have > different cluster buffer address (i.e. agbno+2 vs agbno+3) > > So, when you get an inode number, how do you know it is valid and > you haven't raced with an unlink that just removed the underlying > inode chunk? You can do a buffer lookup to see if it's stale, but > that has all sorts of problems in that a key constraint is that we > must not have overlapping buffers in the cache. How do we know what > buffers we need to look up (and how do we do it in a race free > manner) to ensure that all the original inode cluster buffers have > been invalidated and their transactions committed during an > allocation? > > IOWs, without jumping through all sorts of cluster buffer coherence > validation hoops we end up with free vs allocation and free vs > lookup races on the cluster buffers if we just use inode number > conversions for physical buffer mapping. That's complex, costly and > extremely error prone, so we essentially have to treat all inode > numbers as untrusted because of these races. > > There are two ways to solve this problem. > 1) always look up the inobt record for an inode number to > get the chunk_agbno from the inobt as locking the AGI for > lookup guarantees no alloc/lookup/free races can occur; or > > 2) Ensure that inode chunks never overlap by physically > aligning them at allocation time, hence ensuring that every > physical address maps to exactly one inode number and > cluster buffer address. > > XFS implemented 1) back in 1994 when inode cluster buffers were > introduced. The issue with this is that inobt lookups every time we > want to map an inode number is that it is excitingly expensive. If > we know the inode number is correct (i.e. cames from other internal > metadata that we've already validated), then this is overhead we can > avoid if we constraint the disk format via method 2). > > That was done more than 20 years ago: > > commit 07d3e5d3764a8cf02d2e40397da0018c5c60f70a > Author: Doug Doucette <doucette@xxxxxxxxxxxx> > Date: Tue Jun 4 19:08:18 1996 +0000 > > Support for aligned inode allocation (bug 385316). Support for > superblock versioning (bug 385292). Some cleanup. > > And we've used aligned inodes ever since.... > Dave, Thanks a lot for describing the decisions behind the requirement of inode alignment. -- chandan -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html