On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote: > On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: > > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > > >> FYI: > > >> > > >> I am able reproduce the XFS mount hang issue with hacked kernels based on > > >> both 4.14.35 kernel or 6.4.0 kernel. > > >> Reproduce steps: > > >> > > >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following > > >> steps all aim at this XFS volume. > > > > > > Actually, make that a few milliseconds.... :) > > > > :) > > > > > mkfs/xfs_info output would be appreciated. > > > > sure, > > # xfs_info 20GB.bk2 > > meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks > > = sectsz=512 attr=2, projid32bit=1 > > = crc=0 finobt=0, sparse=0, rmapbt=0 > > = reflink=0 > > Hmmmm. Why are you only testing v4 filesystems? They are deprecated > and support is largely due to be dropped from upstream in 2025... > > Does the same problem occur with a v5 filesystems? > > > >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 > > >> kernel and 6.4.0 kernel. > > > > > > Which is is clearly wrong. It should be at least 360416 bytes (i.e > > > tr_itrunc), because that's what the EFI being processed that pins > > > the tail of the log is supposed to have reserved when it was > > > stalled. > > > > Yep, exactly. > > > > > So where has the ~180kB of leaked space come from? > > > > > > Have you traced the grant head reservations to find out > > > what the runtime log space and grant head reservations actually are? > > I have the numbers in vmcore (ignore the WARNs), > > That's not what I'm asking. You've dumped the values at the time of > the hang, not traced the runtime reservations that have been made. > > > > i.e. we have full tracing of the log reservation accounting via > > > tracepoints in the kernel. If there is a leak occurring, you need to > > > capture a trace of all the reservation accounting operations and > > > post process the output to find out what operation is leaking > > > reserved space. e.g. > > > > > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > > > .... > > > # trace-cmd report > s.t > > > # head -3 s.t > > > cpus=16 > > > touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > > > > # > > > > > > So this tells us the transaction reservation unit size, the count of > > > reservations, the current reserve and grant head locations, and the > > > current head and tail of the log at the time the transaction > > > reservation is started and then after it completes. > > > > Will do that and report back. You want full log or only some typical > > ones? Full log would be big, how shall I share? > > I don't want to see the log. It'll be huge - I regularly generate > traces containing gigabytes of log accounting traces like this from > a single workload. > > What I'm asking you to do is run the tracing and then post process > the values from the trace to determine what operation is using more > space than is being freed back to the log. > > I generally do this with grep, awk and sed. some people use python > or perl. But either way it's a *lot* of work - in the past I have > spent _weeks_ on trace analysis to find a 4 byte leak in the log > space accounting. DOing things like graphing the head, tail and grant > spaces over time tend to show if this is a gradual leak versus a > sudden step change. If it's a sudden step change, then you can > isolate it in the trace and work out what happened. If it's a > gradual change, then you need to start looking for accounting > discrepancies... Any chance you'd be willing to share that pipeline? It'd be useful to stash that kind of debugging program in xfsprogs/tools to save time and eliminate an entire class of "Hey Dave, did I capture this correctly?" questions. (At least until someone changes the tracepoints :P) --D > e.g. a transaction records 32 bytes used in the item, so it releases > t_unit - 32 bytes at commit. However, the CIL may then only track 28 > bytes of space for the item in the journal and we leak 4 bytes of > reservation on every on of those items committed. > > These sorts of leaks typically only add up to being somethign > significant in situations where the log is flooded with tiny inode > timestamp changes - 4 bytes iper item doesn't really matter when you > only have a few thousand items in the log, but when you have > hundreds of thousands of tiny items in the log... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx