On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote: > We have had a large number of recent reports about cloud filesystems > with "corrupt" inode records recently. They are all the same, and > feature a filesystem that has been grown from a small size to a > larger size (to 30G or 50G). In all cases, they have a very small > runt AG at the end of the filesystem. In the case of the 30GB > filesystems, this is 1031 blocks long. > > These filesystems start issuing corruption warnings when trying to > allocate an in a sparse cluster at block 1024 of the runt AG. At > this point, there shouldn't be a sparse inode cluster because there > isn't space to fit an entire inode chunk (8 blocks) at block 1024. > i.e. it is only 7 blocks from the end of the AG. > > Hence the first bug is that we allowed allocation of a sparse inode > cluster in this location when it should not have occurred. The first > patch in the series addresses this. > > However, there is actually nothing corrupt in the on-disk sparse > inode record or inode cluster at agbno 1024. It is a 32 inode > cluster, which means it is 4 blocks in length, so sits entirely > within the AG and every inode in the record is addressable and > accessible. The only thing we can't do is make the sparse inode > record whole - the inode allocation code cannot allocate another 4 > blocks that span beyond the end of the AG. Hence this inode record > and cluster remain sparse until all the inodes in it are freed and > the cluster removed from disk. > > The second bug is that we don't consider inodes beyond inode cluster > alignment at the end of an AG as being valid. When sparse inode > alignment is in use, we set the in-memory inode cluster alignment to > match the inode chunk alignment, and so the maximum valid inode > number is inode chunk aligned, not inode cluster aligned. Hence when > we have an inode cluster at the end of the AG - so the max inode > number is cluster aligned - we reject that entire cluster as being > invalid. > > As stated above, there is nothing corrupt about the sparse inode > cluster at the end of the AG, it just doesn't match an arbitrary > alignment validation restriction for inodes at the end of the AG. > Given we have production filesystems out there with sparse inode > clusters allocated with cluster alignment at the end of the AG, we > need to consider these inodes as valid and not error out with a > corruption report. The second patch addresses this. > > The third issue I found is that we never validate the > sb->sb_spino_align valid when we mount the filesystem. It could have > any value and we just blindly use it when calculating inode > allocation geometry. The third patch adds sb->sb_spino_align range > validation to the superblock verifier. > > There is one question that needs to be resolved in this patchset: if > we take patch 2 to allow sparse inodes at the end of the AG, why > would we need the change in patch 1? Indeed, at this point I have to > ask why we even need the min/max agbno guidelines to the inode chunk > allocation as we end up allowing any aligned location in the AG to > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > unnecessary and now we can remove a bunch of code (min/max_agbno > constraints) from the allocator paths... Yeah, I think we can only adjust the codebase to allow chunks that cross EOAG as long as the clusters don't. > I'd prefer that we take the latter path: ignore the first patch. > This results in more flexible behaviour, allows existing filesystems > with this issue to work without needing xfs_repair to fix them, and > we get to remove complexity from the code. Do xfs_repair/scrub trip over these sparse chunks that cross EOAG, or are they ok? --D > Thoughts? >