Re: Question: reserve log space at IO time for recover

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 18 Aug 2023 03:25:46 +0000

What things looks to be:

For the file deletion.  log bytes are reserved basing on xfs_mount->tr_itruncate which is:
{
    tr_logres = 175488,
    tr_logcount = 2,
    tr_logflags = XFS_TRANS_PERM_LOG_RES,
}  
You see it’s permanent log reservation with two log operations (two transactions in rolling mode).
After calculation (xlog_calc_unit_res(), adding space for various log headers), the final
log space needed per transaction changes from  175488 to 180208 bytes. So the total
log space needed is 360416 (180208 * 2).   
Above log space (360416 bytes) needs to be reserved for both run time inode removing
(xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()).

RUNTIME
========

For run time inode removing. The first transaction is mainly used for inode fields change.
The second transaction is used for intents including extent freeing.

For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved
for the coming transaction).  
The first transaction is committed, writing some bytes to log and releasing the left reserved bytes.

Now we have the second transaction which has 180208 log bytes reserved too. The second
transaction is supposed to process intents including extent freeing. With my hacking patch,
I blocked the extent freeing 5 hours. So in that 5 hours,  180208 (NOT 360416) log bytes are reserved.

With my test case, other transactions (update timestamps) then happen. As my hacking patch
pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available
log space (in memory in on disk).  And finally the on disk (and in memory) available log space
goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free)
transaction.

Panic the kernel and remount the xfs volume

LOG RECOVER
=============

With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs
360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available.

THINKING
========
Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd
transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes.  We have
(a bit) more than 180208 available log bytes  on disk, so the reservation goes and the recovery goes.
That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res
xfs_mount->tr_ext_free
{
    tr_logres = 175488,
    tr_logcount = 0,
    tr_logflags = 0,
}
and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it).

thanks,
wengang

> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> 
>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>>>>> FYI:
>>>>> 
>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>>>>> both 4.14.35 kernel or 6.4.0 kernel.
>>>>> Reproduce steps:
>>>>> 
>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>>>> steps all aim at this XFS volume.
>>>> 
>>>> Actually, make that a few milliseconds.... :)
>>> 
>>> :)
>>> 
>>>> mkfs/xfs_info output would be appreciated.
>>> 
>>> sure,
>>> # xfs_info 20GB.bk2
>>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>>>        =                       sectsz=512   attr=2, projid32bit=1
>>>        =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>>        =                       reflink=0
>> 
>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
>> and support is largely due to be dropped from upstream in 2025...
>> 
> 
> Ha, it just happened to be so.
> I was trying to reproduce it in the same environment as customer has —
> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
> v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.
> 
>> Does the same problem occur with a v5 filesystems?
> 
> Will test and report back.
> 
>> 
>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>>>> kernel and 6.4.0 kernel.
>>>> 
>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
>>>> tr_itrunc), because that's what the EFI being processed that pins
>>>> the tail of the log is supposed to have reserved when it was
>>>> stalled.
>>> 
>>> Yep, exactly.
>>> 
>>>> So where has the ~180kB of leaked space come from?
>>>> 
>>>> Have you traced the grant head reservations to find out
>>>> what the runtime log space and grant head reservations actually are?
>>> I have the numbers in vmcore (ignore the WARNs),
>> 
>> That's not what I'm asking. You've dumped the values at the time of
>> the hang, not traced the runtime reservations that have been made.
>> 
>>>> i.e. we have full tracing of the log reservation accounting via
>>>> tracepoints in the kernel. If there is a leak occurring, you need to
>>>> capture a trace of all the reservation accounting operations and
>>>> post process the output to find out what operation is leaking
>>>> reserved space. e.g.
>>>> 
>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
>>>> ....
>>>> # trace-cmd report > s.t
>>>> # head -3 s.t
>>>> cpus=16
>>>>        touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>>        touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>> 
>>>> #
>>>> 
>>>> So this tells us the transaction reservation unit size, the count of
>>>> reservations, the current reserve and grant head locations, and the
>>>> current head and tail of the log at the time the transaction
>>>> reservation is started and then after it completes.
>>> 
>>> Will do that and report back. You want full log or only some typical
>>> ones? Full log would be big, how shall I share? 
>> 
>> I don't want to see the log. It'll be huge - I regularly generate
>> traces containing gigabytes of log accounting traces like this from
>> a single workload.
>> 
>> What I'm asking you to do is run the tracing and then post process
>> the values from the trace to determine what operation is using more
>> space than is being freed back to the log.
>> 
>> I generally do this with grep, awk and sed. some people use python
>> or perl. But either way it's a *lot* of work - in the past I have
>> spent _weeks_ on trace analysis to find a 4 byte leak in the log
>> space accounting. DOing things like graphing the head, tail and grant
>> spaces over time tend to show if this is a gradual leak versus a
>> sudden step change. If it's a sudden step change, then you can
>> isolate it in the trace and work out what happened. If it's a
>> gradual change, then you need to start looking for accounting
>> discrepancies...
>> 
>> e.g. a transaction records 32 bytes used in the item, so it releases
>> t_unit - 32 bytes at commit. However, the CIL may then only track 28
>> bytes of space for the item in the journal and we leak 4 bytes of
>> reservation on every on of those items committed.
>> 
>> These sorts of leaks typically only add up to being somethign
>> significant in situations where the log is flooded with tiny inode
>> timestamp changes - 4 bytes iper item doesn't really matter when you
>> only have a few thousand items in the log, but when you have
>> hundreds of thousands of tiny items in the log...
>> 
> 
> OK. will work more on this.
> # I am going to start a two-week vacation, and will then continue on this when back.
> 
> thanks,
> wengang