Re: Question: reserve log space at IO time for recover

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 28 Jul 2023 17:56:11 +0000

> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>>>> FYI:
>>>> 
>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>>>> both 4.14.35 kernel or 6.4.0 kernel.
>>>> Reproduce steps:
>>>> 
>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>>>  steps all aim at this XFS volume.
>>> 
>>> Actually, make that a few milliseconds.... :)
>> 
>> :)
>> 
>>> mkfs/xfs_info output would be appreciated.
>> 
>> sure,
>> # xfs_info 20GB.bk2
>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>>         =                       sectsz=512   attr=2, projid32bit=1
>>         =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>         =                       reflink=0
> 
> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
> and support is largely due to be dropped from upstream in 2025...
> 

Ha, it just happened to be so.
I was trying to reproduce it in the same environment as customer has —
that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.

> Does the same problem occur with a v5 filesystems?

Will test and report back.

> 
>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>>>  kernel and 6.4.0 kernel.
>>> 
>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
>>> tr_itrunc), because that's what the EFI being processed that pins
>>> the tail of the log is supposed to have reserved when it was
>>> stalled.
>> 
>> Yep, exactly.
>> 
>>> So where has the ~180kB of leaked space come from?
>>> 
>>> Have you traced the grant head reservations to find out
>>> what the runtime log space and grant head reservations actually are?
>> I have the numbers in vmcore (ignore the WARNs),
> 
> That's not what I'm asking. You've dumped the values at the time of
> the hang, not traced the runtime reservations that have been made.
> 
>>> i.e. we have full tracing of the log reservation accounting via
>>> tracepoints in the kernel. If there is a leak occurring, you need to
>>> capture a trace of all the reservation accounting operations and
>>> post process the output to find out what operation is leaking
>>> reserved space. e.g.
>>> 
>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
>>> ....
>>> # trace-cmd report > s.t
>>> # head -3 s.t
>>> cpus=16
>>>         touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>         touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>> 
>>> #
>>> 
>>> So this tells us the transaction reservation unit size, the count of
>>> reservations, the current reserve and grant head locations, and the
>>> current head and tail of the log at the time the transaction
>>> reservation is started and then after it completes.
>> 
>> Will do that and report back. You want full log or only some typical
>> ones? Full log would be big, how shall I share? 
> 
> I don't want to see the log. It'll be huge - I regularly generate
> traces containing gigabytes of log accounting traces like this from
> a single workload.
> 
> What I'm asking you to do is run the tracing and then post process
> the values from the trace to determine what operation is using more
> space than is being freed back to the log.
> 
> I generally do this with grep, awk and sed. some people use python
> or perl. But either way it's a *lot* of work - in the past I have
> spent _weeks_ on trace analysis to find a 4 byte leak in the log
> space accounting. DOing things like graphing the head, tail and grant
> spaces over time tend to show if this is a gradual leak versus a
> sudden step change. If it's a sudden step change, then you can
> isolate it in the trace and work out what happened. If it's a
> gradual change, then you need to start looking for accounting
> discrepancies...
> 
> e.g. a transaction records 32 bytes used in the item, so it releases
> t_unit - 32 bytes at commit. However, the CIL may then only track 28
> bytes of space for the item in the journal and we leak 4 bytes of
> reservation on every on of those items committed.
> 
> These sorts of leaks typically only add up to being somethign
> significant in situations where the log is flooded with tiny inode
> timestamp changes - 4 bytes iper item doesn't really matter when you
> only have a few thousand items in the log, but when you have
> hundreds of thousands of tiny items in the log...
> 

OK. will work more on this.
# I am going to start a two-week vacation, and will then continue on this when back.

thanks,
wengang