Re: [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs

Dave Chinner <david@xxxxxxxxxxxxx> · Sun, 27 Oct 2024 08:47:34 +1100

On Fri, Oct 25, 2024 at 03:19:19PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 25, 2024 at 05:43:41PM +1100, Dave Chinner wrote:
> > On Thu, Oct 24, 2024 at 10:00:38AM -0700, Darrick J. Wong wrote:
> > > On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > 
> > > > Due to the failure to correctly limit sparse inode chunk allocation
> > > > in runt AGs, we now have many production filesystems with sparse
> > > > inode chunks allocated across the end of the runt AG. xfs_repair
> > > > or a growfs is needed to fix this situation, neither of which are
> > > > particularly appealing.
> > > > 
> > > > The on disk layout from the metadump shows AG 12 as a runt that is
> > > > 1031 blocks in length and the last inode chunk allocated on disk at
> > > > agino 8192.
> > > 
> > > Does this problem also happen on non-runt AGs?
> > 
> > No. The highest agbno an inode chunk can be allocated at in a full
> > size AG is aligned by rounding down from sb_agblocks.  Hence
> > sb_agblocks can be unaligned and nothing will go wrong. The problem
> > is purely that the runt AG being shorter than sb_agblocks and so
> > this highest agbno allocation guard is set beyond the end of the
> > AG...
> 
> Ah, right, and we don't want sparse inode chunks to cross EOAG because
> then you'd have a chunk whose clusters would cross into the next AG, at
> least in the linear LBA space.  That's why (for sparse inode fses) it
> makes sense that we want to round last_agino down by the chunk for
> non-last AGs, and round it down by only the cluster for the last AG.
> 
> Waitaminute, what if the last AG is less than a chunk but more than a
> cluster's worth of blocks short of sb_agblocks?  Or what if sb_agblocks
> doesn't align with a chunk boundary?  I think the new code:
> 
> 	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
> 		end_align = mp->m_sb.sb_spino_align;
> 	else
> 		end_align = M_IGEO(mp)->cluster_align;
> 	bno = round_down(eoag, end_align);
> 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
> 
> will allow a sparse chunk that (erroneously) crosses sb_agblocks, right?
> Let's say sb_spino_align == 4, sb_inoalignmt == 8, sb_agcount == 2,
> sb_agblocks == 100,007, and sb_dblocks == 200,014.
> 
> For AG 0, eoag is 100007, end_align == cluster_align == 8, so bno is
> rounded down to 100000.  *last is thus set to the inode at the end of
> block 99999.
> 
> For AG 1, eoag is also 100007, but now end_align == 4.  bno is rounded
> down to 100,004.  *last is set to the inode at the end of block 100003,
> not 99999.
> 
> But now let's say we growfs another 100007 blocks onto the filesystem.
> Now we have 3x AGs, each with 100007 blocks.  But now *last for AG 1
> becomes 99999 even though we might've allocated an inode in block
> 100000 before the growfs.  That will cause a corruption error too,
> right?

Yes, I overlooked that case. Good catch.

> IOWs, don't we want something more like this?
> 
> 	/*
> 	 * The preferred inode cluster allocation size cannot ever cross
> 	 * sb_agblocks.  cluster_align is one of the following:
> 	 *
> 	 * - For sparse inodes, this is an inode chunk.
> 	 * - For aligned non-sparse inodes, this is an inode cluster.
> 	 */
> 	bno = round_down(sb_agblocks, cluster_align);
> 	if (xfs_has_sparseinodes(mp) &&
> 	    agno == mp->m_sb.sb_agcount - 1) {
> 		/*
> 		 * For a filesystem with sparse inodes, an inode chunk
> 		 * still cannot cross sb_agblocks, but it can cross eoag
> 		 * if eoag < agblocks.  Inode clusters cannot cross eoag.
> 		 */
> 		last_clus_bno = round_down(eoag, sb_spino_align);
> 		bno = min(bno, last_clus_bno);
> 	}
> 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;

Yes, something like that is needed.

> > > If the only free space
> > > that could be turned into a sparse cluster is unaligned space at the
> > > end of AG 0, would you still get the same corruption error?
> > 
> > It will only happen if AG 0 is a runt AG, and then the same error
> > would occur. We don't currently allow single AG filesystems, nor
> > when they are set up  do we create them as a runt - the are always
> > full size. So current single AG filesystems made by mkfs won't have
> > this problem.
> 
> Hmm, do you have a quick means to simulate this last-AG unaligned
> icluster situation?

No, I haven't been able to reproduce it on demand - nothing I've
tried has specifically landed a sparse inode cluster in exactly the
right position to trigger this. I typically get ENOSPC when I think
it should trigger and it's not immediately obvious what I'm missing
in way of pre-conditions to trigger it. I've been able to test the
fixes on a metadump that has the sparse chunk already on disk
(which came from one of the production systems hitting this).

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx