Re: Question: reserve log space at IO time for recover

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 24 Aug 2023 17:28:23 +1000

On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> 
> Since xfs_efi_item_recover is only performing one step of what could be
> a chain of deferred updates, it never rolls the transaction that it
> creates.  It therefore only requires the amount of grant space that
> you'd get with tr_logcount == 1.  It is therefore a bit silly that we
> ask for more than that, and in bad cases like this, hang log recovery
> needlessly.

But this doesn't fix the whatever problem lead to the recovery not
having the same full tr_itruncate reservation available as was held
by the transaction that logged the EFI and was running the extent
free at the time the system crashed. There should -always- be enough
transaction reservation space in the journal to reserve space for an
intent replay if the intent recovery reservation uses the same
reservation type as runtime.

Hence I think this is still just a band-aid over whatever went wrong
at runtime that lead to the log not having enough space for a
reservation that was clearly held at runtime and hadn't yet used.

> Which is exactly what you theorized above.  Ok, I'm starting to be
> convinced... :)

I'm not. :/

> I wonder, if you add this to the variable declarations in
> xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about
> UEK5 here):
> 
> 	struct xfs_trans_resv	resv = M_RES(mp)->tr_itruncate;
> 
> and then change the xfs_trans_alloc call to:
> 
> 	resv.tr_logcount = 1;
> 	error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
> 
> Does that solve the problem?

It might fix this specific case given the state of the log at the
time the system crashed. However, it doesn't help the general
case where whatever went wrong at runtime results in there being
less than a single tr_itruncate reservation unit available in the
log.

One of the recent RH custom cases I looked at had much less than a
single tr_itruncate unit reservation when it came to recovering the
EFI, so I'm not just making this up.

I really don't think this problem is not solvable at recovery time.
if the runtime hang/crash leaves the EFI pinning the tail of the log
and something has allowed the log to be overfull (i.e. log space
used plus granted EFI reservation > physical log space), then we may
not have any space at all for any sort of reservation during
recovery.

I suspect that the cause of this recovery issue is the we require an
overcommit of the log space accounting before we throttle incoming
transaction reservations (i.e. the new reservation has to overrun
before we throttle). I think that the reservation accounting overrun
detection can race to the first item being placed on the wait list,
which might allow transactions that should be throttled to escape
resulting in a temporary log space overcommit. IF we crash at jsut
the wrong time and an intent-in-progress pins the tail of the log,
the overcommit situation can lead to recovery not having enough
space for intent recovery reservations...

This subtle overcommit is one of the issues I have been trying to
correct with the byte-based grant head accounting patches (which I'm
-finally- getting back to). I know that replaying the log from the
metadump that repoduces the hang with the byte-based grant head
accounting patchset allowed log recovery to succeed. It has a
different concept of where the head and tail are and hence how much
log space is actually available at any given time, and that
difference was just enough to allow a tr_itruncate reservation to
succced. It also has different reservation grant overrun detection
logic, but I'm not 100% sure if it solves this underlying runtime
overcommit problem or not yet...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx