On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote: > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: > > Since xfs_efi_item_recover is only performing one step of what could be > a chain of deferred updates, it never rolls the transaction that it > creates. It therefore only requires the amount of grant space that > you'd get with tr_logcount == 1. It is therefore a bit silly that we > ask for more than that, and in bad cases like this, hang log recovery > needlessly. But this doesn't fix the whatever problem lead to the recovery not having the same full tr_itruncate reservation available as was held by the transaction that logged the EFI and was running the extent free at the time the system crashed. There should -always- be enough transaction reservation space in the journal to reserve space for an intent replay if the intent recovery reservation uses the same reservation type as runtime. Hence I think this is still just a band-aid over whatever went wrong at runtime that lead to the log not having enough space for a reservation that was clearly held at runtime and hadn't yet used. > Which is exactly what you theorized above. Ok, I'm starting to be > convinced... :) I'm not. :/ > I wonder, if you add this to the variable declarations in > xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about > UEK5 here): > > struct xfs_trans_resv resv = M_RES(mp)->tr_itruncate; > > and then change the xfs_trans_alloc call to: > > resv.tr_logcount = 1; > error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); > > Does that solve the problem? It might fix this specific case given the state of the log at the time the system crashed. However, it doesn't help the general case where whatever went wrong at runtime results in there being less than a single tr_itruncate reservation unit available in the log. One of the recent RH custom cases I looked at had much less than a single tr_itruncate unit reservation when it came to recovering the EFI, so I'm not just making this up. I really don't think this problem is not solvable at recovery time. if the runtime hang/crash leaves the EFI pinning the tail of the log and something has allowed the log to be overfull (i.e. log space used plus granted EFI reservation > physical log space), then we may not have any space at all for any sort of reservation during recovery. I suspect that the cause of this recovery issue is the we require an overcommit of the log space accounting before we throttle incoming transaction reservations (i.e. the new reservation has to overrun before we throttle). I think that the reservation accounting overrun detection can race to the first item being placed on the wait list, which might allow transactions that should be throttled to escape resulting in a temporary log space overcommit. IF we crash at jsut the wrong time and an intent-in-progress pins the tail of the log, the overcommit situation can lead to recovery not having enough space for intent recovery reservations... This subtle overcommit is one of the issues I have been trying to correct with the byte-based grant head accounting patches (which I'm -finally- getting back to). I know that replaying the log from the metadump that repoduces the hang with the byte-based grant head accounting patchset allowed log recovery to succeed. It has a different concept of where the head and tail are and hence how much log space is actually available at any given time, and that difference was just enough to allow a tr_itruncate reservation to succced. It also has different reservation grant overrun detection logic, but I'm not 100% sure if it solves this underlying runtime overcommit problem or not yet... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx