Re: [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 31 Mar 2022 09:20:53 -0700

On Wed, Mar 30, 2022 at 07:21:36PM -1000, Eric Sandeen wrote:
> On 3/16/22 8:50 AM, Eric Sandeen wrote:
> > On 3/15/22 6:23 PM, Darrick J. Wong wrote:
> >> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> >>
> >> Currently, we don't let an internal log consume every last block in an
> >> AG.  According to the comment, we're doing this to avoid tripping AGF
> >> verifiers if freeblks==0, but on a modern filesystem this isn't
> >> sufficient to avoid problems.  First, the per-AG reservations for
> >> reflink and rmap claim up to about 1.7% of each AG for btree expansion,
> > 
> > Hm, will that be a factor if the log consumes every last block in that
> > AG? Or is the problem that if we consume "most" blocks, that leaves the
> > possibility of reflink/rmap btree expansion subsequently failing because
> > we do have a little room for new allocations in that AG?
> > 
> > Or is it a problem right out of the gate because the per-ag reservations
> > collide with a maximal log before the filesystem is even in use?
> 
> Darrick, any comment on this? What did you actually run into that prompted
> this change?

Oops, sorry, forgot to reply to this.

The per-AG reservations shouldn't be a problem on a modern kernel
because we assume an internal log never moves or grows, so we only need
to account for rmap/refcount btree expansions to cover space used
elsewhere in the AG.  That said, *one* block is cutting it really close.

Also, earlier kernels (4.9 era) weren't smart about that, though 4.9 is
dead according to gregkh and any kernel that says rmap/reflink are
experimental should not be used.

The problem that I saw is that you could trick the default calculations
into making the log to consume so much space that either (a) the root
directory gets allocated in the next AG, or if you were <cough>
formatting a single AG then mkfs just exits with a cryptic message about
ENOSPC.

It's fairly difficult to trigger this for a plain old block device, but
if mkfs thinks the disk has a gigantic stripe unit, it'll go around
putting the root directory at a much higher block number than is usual.
For the most part the "max_ag_blocks - 1" was actually fine, but not so
much for the case where the AG size is close to 64MB.

So in the end the 5% figure is still rather handwavy wormcanning.

> Still bugs me a little that a manually-sized log escapes this limit, and if
> it's needed for proper functioning, we should probably enforce it everywhere.

<nod> I guess the same limits ought to be applied to explicit logsize to
improve the "Well if it hurts don't do that!" experience. ;)

> I do understand that the existing code only validates auto-sized logs. But
> I don't want to sweep this under the rug, even if we choose to not fix it all
> right now.
> 
> Mostly looking for clarification on the what fails and how, with the current
> code.

<nod> I'll try to add a sample mkfs failure to the commit message.

--D

> 
> Thanks,
> -Eric