Re: Question: reserve log space at IO time for recover

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> >> FYI:
> >> 
> >> I am able reproduce the XFS mount hang issue with hacked kernels based on
> >> both 4.14.35 kernel or 6.4.0 kernel.
> >> Reproduce steps:
> >> 
> >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
> >>   steps all aim at this XFS volume.
> > 
> > Actually, make that a few milliseconds.... :)
> 
> :)
> 
> > mkfs/xfs_info output would be appreciated.
> 
> sure,
> # xfs_info 20GB.bk2
> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0

Hmmmm. Why are you only testing v4 filesystems? They are deprecated
and support is largely due to be dropped from upstream in 2025...

Does the same problem occur with a v5 filesystems?

> >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
> >>   kernel and 6.4.0 kernel.
> > 
> > Which is is clearly wrong. It should be at least 360416 bytes (i.e
> > tr_itrunc), because that's what the EFI being processed that pins
> > the tail of the log is supposed to have reserved when it was
> > stalled.
> 
> Yep, exactly.
> 
> > So where has the ~180kB of leaked space come from?
> > 
> > Have you traced the grant head reservations to find out
> > what the runtime log space and grant head reservations actually are?
> I have the numbers in vmcore (ignore the WARNs),

That's not what I'm asking. You've dumped the values at the time of
the hang, not traced the runtime reservations that have been made.

> > i.e. we have full tracing of the log reservation accounting via
> > tracepoints in the kernel. If there is a leak occurring, you need to
> > capture a trace of all the reservation accounting operations and
> > post process the output to find out what operation is leaking
> > reserved space. e.g.
> > 
> > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> > ....
> > # trace-cmd report > s.t
> > # head -3 s.t
> > cpus=16
> >          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> >          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > 
> > #
> > 
> > So this tells us the transaction reservation unit size, the count of
> > reservations, the current reserve and grant head locations, and the
> > current head and tail of the log at the time the transaction
> > reservation is started and then after it completes.
> 
> Will do that and report back. You want full log or only some typical
> ones? Full log would be big, how shall I share? 

I don't want to see the log. It'll be huge - I regularly generate
traces containing gigabytes of log accounting traces like this from
a single workload.

What I'm asking you to do is run the tracing and then post process
the values from the trace to determine what operation is using more
space than is being freed back to the log.

I generally do this with grep, awk and sed. some people use python
or perl. But either way it's a *lot* of work - in the past I have
spent _weeks_ on trace analysis to find a 4 byte leak in the log
space accounting. DOing things like graphing the head, tail and grant
spaces over time tend to show if this is a gradual leak versus a
sudden step change. If it's a sudden step change, then you can
isolate it in the trace and work out what happened. If it's a
gradual change, then you need to start looking for accounting
discrepancies...

e.g. a transaction records 32 bytes used in the item, so it releases
t_unit - 32 bytes at commit. However, the CIL may then only track 28
bytes of space for the item in the journal and we leak 4 bytes of
reservation on every on of those items committed.

These sorts of leaks typically only add up to being somethign
significant in situations where the log is flooded with tiny inode
timestamp changes - 4 bytes iper item doesn't really matter when you
only have a few thousand items in the log, but when you have
hundreds of thousands of tiny items in the log...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux