Re: Question: reserve log space at IO time for recover

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 21 Jul 2023 19:36:03 +0000

FYI:

I am able reproduce the XFS mount hang issue with hacked kernels based on
both 4.14.35 kernel or 6.4.0 kernel.

The hacking patch is like this (based on 6.4.0 kernel):

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f9e36b810663..e04e8738c564 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -491,6 +491,12 @@ xfs_extent_free_finish_item(
    struct xfs_extent_free_item *xefi;
    int             error;

+   if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) {
+       pr_err("loop0 free extent enter sleeping 5 hours\n");
+       msleep(3600*1000*5);
+       pr_err("loop0 free extent end sleeping 5 hours\n");
+   }
+
    xefi = container_of(item, struct xfs_extent_free_item, xefi_list);

    error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 7d4109af193e..b571fe6966af 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -557,6 +557,13 @@ xfsaild_push(
    xfs_trans_ail_cursor_done(&cur);
    spin_unlock(&ailp->ail_lock);

+   if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) {
+       pr_err("loop0 aild enter sleeping 5 hours\n");
+       msleep(3600*1000*5); // sleep 5 hours
+       pr_err("loop0 aild end sleeping 5 hours\n");
+       pr_err("loop0 start issuing buffer IO\n");
+   }
+
    if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
        ailp->ail_log_flush++;

The hacking patch does two things:
1. sleeps 5 hours in the ext freeing deferred op before reserving log space.
   That simulates very slow extent free processing
2. sleeps 5 hours before flushing metadata buffers. That simulate very slow
   AIL processing.
Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes
to reproduce this issue.

Reproduce steps:

1. create a XFS with 10MiB log size (small so easier to reproduce). The following
   steps all aim at this XFS volume.

2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated
   directory. We will overwrite those files causing inode timestamp logging.
   A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less
   than the required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get
   the exact number for 6.4.0, but should be similar. With those timestamp updating,
    we are able to make the free log size less than 360416.

3. With hacked kernel, create a new file, allocating block(s) to it and remove it.
   Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours.
   We will panic the kernel in 5 hours so that with next mount, EFI recover would
   happen.

4. with another console, overwrite serially that 100K files we created in step2.
   This get stuck after overwritten 57K+ files. That's because we used up (almost)
   all the free log space and we pinned log tail position with the hacking patch.
   Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available).
   The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel.

5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
   kernel and 6.4.0 kernel. It’s not same as the free log space as seen in vmcore,
   1552 bytes. I am not able to figure oute why they are different yet. The delta aims
   to make log recover safe?

6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk
   free log space, it's the same as seen in step 5.

7. with un-hacked kernel, mount that xfs volume, it hang with both 4.14.35 kernel
   and 6.4.0 kernel. Checking vmcore, now the free log bytes now is the same as on
   disk free log bytes we found in step 6. That's for 4.14.25 kernel. I am not able
   to check the numbers for 6.4.0 kernel, but think it's same as seen in 4.14.35
   kernel. So reproduced!!

Checking back to the hacking patch, I am simulating very slow buffer flushing and
extent freeing. I'd think its possible that the same things can get slow somehow
in real production cases. So the hacking patch is reasonable.

When log free bytes runs out and a kernel panic or manual panic/reboot, like
  echo c/b >/proc/sysrq-trigger
happened, it's possible that even 6.4.0 kernel can run into the same sitiation as
seen in 4.14.35 kernel. So 6.4.0 kernel needs fix too.

I will look more on this.

thanks,
wengang

> On Jul 18, 2023, at 11:25 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, Jul 18, 2023 at 06:44:13PM -0700, Darrick J. Wong wrote:
>> On Wed, Jul 19, 2023 at 10:11:03AM +1000, Dave Chinner wrote:
>>> On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote:
>>>> Hi,
>>>> 
>>>> I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
>>>> mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
>>>> free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
>>>> Thus the mount hangs for ever.
>>> 
>>> Most likely something went wrong at runtime on the 4.14.35 kernel
>>> prior to the crash, leaving the on-disk state in an impossible to
>>> recover state. Likely an accounting leak in a transaction
>>> reservation somewhere, likely in passing the space used from the
>>> transaction to the CIL. We've had bugs in this area before, they
>>> eventually manifest in log hangs like this either at runtime or
>>> during recovery...
>>> 
>>>> That happens with 4.14.35 kernel and also upstream
>>>> kernel (6.4.0).
>>> 
>>> Upgrading the kernel won't fix recovery - it is likely that the
>>> journal state on disk is invalid and so the mount cannot complete 
>> 
>> Hmm.  It'd be nice to know what the kernel thought it was doing when it
>> went down.
>> 
>> /me wonders if this has anything to do with the EFI recovery creating a
>> transaction with tr_itruncate reservation because the log itself doesn't
>> record the reservations of the active transactions.
> 
> Possibly - it's been that way since 1994 but I don't recall it ever
> causing any issues in the past. That's not to say it's correct - I
> think it's wrong, but I think the whole transaction reservation
> calculation infrastructure needs a complete overhaul....
> 
>> 
>> <begin handwaving>
>> 
>> Let's say you have a 1000K log, a tr_write reservation is 100k, and a
>> tr_itruncate reservations are 300k.  In this case, you could
>> theoretically have 10x tr_write transactions running concurrently; or
>> you could have 3x tr_itruncate transactions running concurrently.
>> 
>> Now let's say that someone fires up 10 programs that try to fpunch 10
>> separate files.  Those ten threads will consume all the log grant space,
>> unmap a block, and log an EFI. I think in reality tr_logcount means
>> that 5 threads each consume (2*100k) grant space, but the point here is
>> that we've used up all the log grant space.
>> 
>> Then crash the system, having committed the first transaction of the
>> two-transaction chain.
>> 
>> Upon recovery, we'll find the 10x unfinished EFIs and pass them to EFI
>> recovery.  However, recovery creates a separate tr_itruncate transaction
>> to finish each EFI.  Now do we have a problem because the required log
>> grant space is 300k * 10 = 3000k?
> 
> Hmmmm. That smells wrong. Can't put my finger on it .....
> 
> 
> .... ah. Yeah. That.
> 
> We only run one transaction at a time, and we commit the transaction
> after logging new intents and capturing the work that remains. So we
> return the unused part of the reservation (most of it) back to the
> log before we try to recover the next intent in the AIL.
> 
> Hence we don't need (300k * 10) in the log to recover these EFIs as
> we don't hold all ten reservations at the same time (as we would
> have at runtime) - we need (log space used by recovery of intents +
> one reservation) to recover them all.
> 
> Once we've replayed all the intents from the AIL and converted them
> into newly captured intents, they are removed from the AIL and that
> moves the tail of the log forwards. This frees up the entire of the
> log, and we then run the captured intents that still need to be
> processed. We run them one at a time to completion, committing them
> as we go, so again we only need space in the log for a single
> transaction reservation to complete recovery of the intent chaings.
> 
> IOWs, because recovery of intents is single threaded, we only need
> to preserve space in the log for a single reservation to make
> forwards progress.
> 
>> It's late and I don't remember how recovery for non-intent items works
>> quite well enough to think that scenario adds up.  Maybe it was the case
>> that before the system went down, the log had used 800K of the grant
>> space for logged buffers and 100K for a single EFI logged in a tr_write
>> transaction.  Then we crashed, reloaded the 800K of stuff, and now we're
>> trying to allocate 300K for a tr_itruncate to restart the EFI, but
>> there's not enough log grant space?
> 
> Possible, if the EFI pins the tail of the log and the rest of the
> log is full. I've never seen that happen, and we've been using
> itrunc reservations in recovery since 1994, but that doesn't mean it
> can't happen.
> 
> FWIW, I don't think this is EFI specific. BUI recovery use itrunc
> reservations, but what if it was an unlink operation using a remove
> reservation to free a directory block that logs a BUI? Same problem,
> different vector, right?
> 
> I suspect what we really need is for all the intent processing to be
> restartable. We already have those for RUIs to relog/restart them
> there isn't enough reservation space available to complete
> processing the current RUI. We just made EFIs restartable to avoid
> busy extent deadlocks, we can easily extend that the same "enough
> reservation available" as we use for RUIs, etc. Do the same for BUIs
> and RUIs, and then...
> 
> ... we can set up a reservation calculation for each intent type,
> and the reservation needed for a given chain of operations is the
> max of all the steps in the chain. Hence if we get part way through
> a chain and run out of reservation, we can restart the chain the
> the reservation we know is large enough to complete the remaining
> part of the chain.
> 
> The new reservation may be smaller than the reservation that was
> held when we start the intent processing (because it's a rolling
> chain with an inherited log ticket), but this guarantees that we can
> reduce the reservation to the minimum required at any point in
> time if we are running low on log space....
> 
> This also gets around the problem of having to reserve enough space
> for N operations (e.g. 4 extents in an EFI) when the vast majority
> only use and need space for 1 operation. If we get an EFI with N
> extents in it, we can try a reservation for (N * efi_res) and if we
> can't get that we could just use efi_res and work through the EFI
> one extent at a time....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx