Re: Question: reserve log space at IO time for recover

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Mon, 24 Jul 2023 18:03:02 +0000

> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>> FYI:
>> 
>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>> both 4.14.35 kernel or 6.4.0 kernel.
>> 
>> The hacking patch is like this (based on 6.4.0 kernel):
>> 
>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>> index f9e36b810663..e04e8738c564 100644
>> --- a/fs/xfs/xfs_extfree_item.c
>> +++ b/fs/xfs/xfs_extfree_item.c
>> @@ -491,6 +491,12 @@ xfs_extent_free_finish_item(
>>    struct xfs_extent_free_item *xefi;
>>    int             error;
>> 
>> +   if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) {
>> +       pr_err("loop0 free extent enter sleeping 5 hours\n");
>> +       msleep(3600*1000*5);
>> +       pr_err("loop0 free extent end sleeping 5 hours\n");
>> +   }
>>    xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
>> 
>>    error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
>> index 7d4109af193e..b571fe6966af 100644
>> --- a/fs/xfs/xfs_trans_ail.c
>> +++ b/fs/xfs/xfs_trans_ail.c
>> @@ -557,6 +557,13 @@ xfsaild_push(
>>    xfs_trans_ail_cursor_done(&cur);
>>    spin_unlock(&ailp->ail_lock);
>> 
>> +   if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) {
>> +       pr_err("loop0 aild enter sleeping 5 hours\n");
>> +       msleep(3600*1000*5); // sleep 5 hours
>> +       pr_err("loop0 aild end sleeping 5 hours\n");
>> +       pr_err("loop0 start issuing buffer IO\n");
>> +   }
>> +
>>    if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>>        ailp->ail_log_flush++;
>> 
>> The hacking patch does two things:
>> 1. sleeps 5 hours in the ext freeing deferred op before reserving log space.
> 
> No it doesn't. The delay you added above occurs *after* the
> transaction has already guaranteed it has full log reservation for
> the extent freeing operation. You even use the transaction pointer
> to get the mount id (loop0) before stalling it....

Right, I wanted to do so but made wrong patch. I should do it in the
transaction rolling after committing the orig transaction and before reserving
new resources.

> 
>>   That simulates very slow extent free processing
> 
>> 2. sleeps 5 hours before flushing metadata buffers. That simulate
> very slow
>>   AIL processing.
> 
> Well, it actually simulates *completely stalled* AIL processing.
> If it was simulating slow processing, then the tail of the log would
> always still be moving forwards slowly. Stalling metadata writeback
> for 5 hours is effectively hanging the journal for 5 hours, not
> making it "go slow". Those are very different situations that result
> in very different behaviours.
> 
>> Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes
>> to reproduce this issue.
> 
> You should be able to fully stall the log and filesystem in under a
> second with this patch.
> 

The script allocates transaction reserving resource and then commit the
transaction then another round. I didn’t want there are pending (with resources
reserved but not committed) transaction(s) exist when the free log bytes
exhausts. If there are enough pending transactions in memory, the on-disk
free log bytes could be enough for EFI recover. So the the script didn’t
use up (almost) all the free log bytes that quickly. The script can be improved,
well, it already indicates things.

>> Reproduce steps:
>> 
>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>   steps all aim at this XFS volume.
> 
> Actually, make that a few milliseconds.... :)
> 

:)

> mkfs/xfs_info output would be appreciated.

sure,
# xfs_info 20GB.bk2
meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

> 
>> 2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated
>>   directory.
> We will overwrite those files causing inode timestamp logging.
> 
> Ok, that's step 4, right? You're not doing that here?

Yes, that’s step 4.

> 
>>   A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less
>>   than the
> required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get
>>   the exact number for 6.4.0, but should be similar. With those timestamp updating,
>>    we are able to make the free log size less than
> 360416.
>> 
>> 3. With hacked kernel, create a new file, allocating block(s) to it and remove it.
>>   Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours.
>>   We will panic the kernel in 5 hours so that with next mount, EFI recover would
>>   happen.
> 
> Ok, so one EFI at the tail of the log, with a full reserve grant
> head and write grant head reservation for the extent free
> transaction that has been blocked.

Yes, with current patch (Though I didn’t want the reservation).

> 
>> 4. with another console, overwrite serially that 100K files we created in step2.
>>   This get stuck after overwritten 57K+ files. That's because we used up (almost)
>>   all the free log space and we pinned log tail position with the hacking patch.
>> 
> Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available).
>>   The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel.
> 
> Yup, and the reserve grant head should also include at least one
> full tr_itrunc reservation for the EFI in progress.

Yes (though I didn’t want that tr_itrunc reservation).

> 
>> 
>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>   kernel and 6.4.0 kernel.
> 
> Which is is clearly wrong. It should be at least 360416 bytes (i.e
> tr_itrunc), because that's what the EFI being processed that pins
> the tail of the log is supposed to have reserved when it was
> stalled.

Yep, exactly.

> So where has the ~180kB of leaked space come from?
> 
> Have you traced the grant head reservations to find out
> what the runtime log space and grant head reservations actually are?
I have the numbers in vmcore (ignore the WARNs),

     74   l_reserve_head
     75     head position:   cycle: 8 bytes: 2978800
     76     tail position:   cycle: 7 bytes: 2980352
     77     lastSync pos :   cycle: 8 bytes: 2797568
     78     available_size with head: 1552 bytes
     79     1 log space reservers waiting
     80 DEBUG: *(struct xlog_ticket *)0xffff9ba296dc68a0 = {
     81         .t_queue = (struct list_head){
     82                 .next = (struct list_head *)0xffff9ba2cfaabd88,
     83                 .prev = (struct list_head *)0xffff9ba2cfaabd88,
     84         },
     85         .t_task = (struct task_struct *)0xffff9ba280fc4180,
     86         .t_tid = (xlog_tid_t)1461406111,
     87         .t_ref = (atomic_t){
     88                 .counter = (int)1,
     89         },
     90         .t_curr_res = (int)2508,
     91         .t_unit_res = (int)2508,
     92         .t_ocnt = (char)0,
     93         .t_cnt = (char)0,
     94         .t_flags = (uint16_t)0,
     95         .t_iclog_hdrs = (int)1,
     96 }
     97       process 126306 needs: 2508 bytes
     98     previous waiters need: 2508 bytes
     99 WARN: Pending transactions not checked (to save time)
    100     0 log bytes reserved for active transactions (this FS)
    101     Reserved and to be reserved less than log size, seems xlog.l_tail_lsn not updated some how
    102   l_write_head
    103     head position:   cycle: 8 bytes: 2978800
    104     tail position:   cycle: 7 bytes: 2980352
    105     lastSync pos :   cycle: 8 bytes: 2797568
    106     available_size with head: 1552 bytes
    107     0 log space reservers waiting
    108 WARN: Pending transactions not checked (to save time)
    109     0 log bytes reserved for active transactions (this FS)

> 
> i.e. we have full tracing of the log reservation accounting via
> tracepoints in the kernel. If there is a leak occurring, you need to
> capture a trace of all the reservation accounting operations and
> post process the output to find out what operation is leaking
> reserved space. e.g.
> 
> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> ....
> # trace-cmd report > s.t
> # head -3 s.t
> cpus=16
>          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> 
> #
> 
> So this tells us the transaction reservation unit size, the count of
> reservations, the current reserve and grant head locations, and the
> current head and tail of the log at the time the transaction
> reservation is started and then after it completes.

Will do that and report back. You want full log or only some typical
ones? Full log would be big, how shall I share? 

> 
> Hence we can tell if the reservation moved the grant head correctly,
> whether there was space available prior to the reservation being
> made, whether the grant heads contain in-memory reservations or
> whether they are empty (i.e. match the log tail), etc. 
> 
> All cases where we change the reservation and the grant heads are
> also captured by the above trace-cmd event filter, so a full picture
> of the grant head reservations vs log head/tail can be determined
> from post-processing the events....
> 
> This is how I've found log space leaks in the past. It's entirely
> possible that we are leaking a couple of bytes somewhere in the
> accounting, and that's the cause of the issue. It's happened before,
> and the accounting is complex enough it will most likely happen
> again...
> 
>>   It’s not same as the free log space as seen in vmcore,
>>   1552 bytes.
> I am not able to figure oute why they are different yet. The delta aims
>>   to make log recover safe?
> 
> Because different runtime conditions lead to different free log space
> state.
> 
>> 6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk
>>   free log space, it's the same as seen in step 5.
> 
> Right, so something has gone wrong with the runtime accounting, and
> log recovery is just the messenger. We need to find out what went
> wrong at runtime, not hack about in recovery trying to handle a
> situation that should not be happening in the first place...
> 

Yes, agreed. (I was/am not chasing to fix it in recovery).

thanks,
wengang