Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 24 Oct 2024 09:38:04 -0700

On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote:
> We have had a large number of recent reports about cloud filesystems
> with "corrupt" inode records recently. They are all the same, and
> feature a filesystem that has been grown from a small size to a
> larger size (to 30G or 50G). In all cases, they have a very small
> runt AG at the end of the filesystem.  In the case of the 30GB
> filesystems, this is 1031 blocks long.
> 
> These filesystems start issuing corruption warnings when trying to
> allocate an in a sparse cluster at block 1024 of the runt AG. At
> this point, there shouldn't be a sparse inode cluster because there
> isn't space to fit an entire inode chunk (8 blocks) at block 1024.
> i.e. it is only 7 blocks from the end of the AG.
> 
> Hence the first bug is that we allowed allocation of a sparse inode
> cluster in this location when it should not have occurred. The first
> patch in the series addresses this.
> 
> However, there is actually nothing corrupt in the on-disk sparse
> inode record or inode cluster at agbno 1024. It is a 32 inode
> cluster, which means it is 4 blocks in length, so sits entirely
> within the AG and every inode in the record is addressable and
> accessible. The only thing we can't do is make the sparse inode
> record whole - the inode allocation code cannot allocate another 4
> blocks that span beyond the end of the AG. Hence this inode record
> and cluster remain sparse until all the inodes in it are freed and
> the cluster removed from disk.
> 
> The second bug is that we don't consider inodes beyond inode cluster
> alignment at the end of an AG as being valid. When sparse inode
> alignment is in use, we set the in-memory inode cluster alignment to
> match the inode chunk alignment, and so the maximum valid inode
> number is inode chunk aligned, not inode cluster aligned. Hence when
> we have an inode cluster at the end of the AG - so the max inode
> number is cluster aligned - we reject that entire cluster as being
> invalid. 
> 
> As stated above, there is nothing corrupt about the sparse inode
> cluster at the end of the AG, it just doesn't match an arbitrary
> alignment validation restriction for inodes at the end of the AG.
> Given we have production filesystems out there with sparse inode
> clusters allocated with cluster alignment at the end of the AG, we
> need to consider these inodes as valid and not error out with a
> corruption report.  The second patch addresses this.
> 
> The third issue I found is that we never validate the
> sb->sb_spino_align valid when we mount the filesystem. It could have
> any value and we just blindly use it when calculating inode
> allocation geometry. The third patch adds sb->sb_spino_align range
> validation to the superblock verifier.
> 
> There is one question that needs to be resolved in this patchset: if
> we take patch 2 to allow sparse inodes at the end of the AG, why
> would we need the change in patch 1? Indeed, at this point I have to
> ask why we even need the min/max agbno guidelines to the inode chunk
> allocation as we end up allowing any aligned location in the AG to
> be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> unnecessary and now we can remove a bunch of code (min/max_agbno
> constraints) from the allocator paths...

Yeah, I think we can only adjust the codebase to allow chunks that cross
EOAG as long as the clusters don't.

> I'd prefer that we take the latter path: ignore the first patch.
> This results in more flexible behaviour, allows existing filesystems
> with this issue to work without needing xfs_repair to fix them, and
> we get to remove complexity from the code.

Do xfs_repair/scrub trip over these sparse chunks that cross EOAG,
or are they ok?

--D

> Thoughts?
>