On Wed, Jul 26, 2023 at 08:23:20AM -0700, Darrick J. Wong wrote: > On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote: > > On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: > > > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > > > > i.e. we have full tracing of the log reservation accounting via > > > > tracepoints in the kernel. If there is a leak occurring, you need to > > > > capture a trace of all the reservation accounting operations and > > > > post process the output to find out what operation is leaking > > > > reserved space. e.g. > > > > > > > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > > > > .... > > > > # trace-cmd report > s.t > > > > # head -3 s.t > > > > cpus=16 > > > > touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > > touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > > > > > > # > > > > > > > > So this tells us the transaction reservation unit size, the count of > > > > reservations, the current reserve and grant head locations, and the > > > > current head and tail of the log at the time the transaction > > > > reservation is started and then after it completes. > > > > > > Will do that and report back. You want full log or only some typical > > > ones? Full log would be big, how shall I share? > > > > I don't want to see the log. It'll be huge - I regularly generate > > traces containing gigabytes of log accounting traces like this from > > a single workload. > > > > What I'm asking you to do is run the tracing and then post process > > the values from the trace to determine what operation is using more > > space than is being freed back to the log. > > > > I generally do this with grep, awk and sed. some people use python > > or perl. But either way it's a *lot* of work - in the past I have > > spent _weeks_ on trace analysis to find a 4 byte leak in the log > > space accounting. DOing things like graphing the head, tail and grant > > spaces over time tend to show if this is a gradual leak versus a > > sudden step change. If it's a sudden step change, then you can > > isolate it in the trace and work out what happened. If it's a > > gradual change, then you need to start looking for accounting > > discrepancies... > > Any chance you'd be willing to share that pipeline? It's all just one-off scripts I tend to write from scratch each time. I'm always looking for different things, and I never know what I'm looking for ahead of time, so the parser is always different.... > It'd be useful to > stash that kind of debugging program in xfsprogs/tools to save time and > eliminate an entire class of "Hey Dave, did I capture this correctly?" > questions. > > (At least until someone changes the tracepoints :P) That happens frequently enough... :/ Regardless, it's the process of writing filter scripts that helps me find the problem I'm looking for, not the actual filter script itself. e.g. I've been working on a similar "EFI reservation deadlocks on log space during recovery" issue over the past couple of days from a customer metadump. The filesystem had a 2GB log, and it was full. Log recovery took a several minutes to run there were so many items to recover. The (incomplete) logprint of all the op headers was over a million lines (somewhere around 400k individual log items) and I know that logprint output skipped at least two checkpoints that had over 65,000 ophdrs in them each. So log recovery was recovering roughly half a million log items from the log, and only then it was hanging on the first EFI that needed to be recovered. The mount trace was somewhere around 20GB of event data at the point of the hang, the logprint output indicated about 1.3 million ophdrs in the log that were being processed by recovery. I worked that out to be around half a million individual log items in the log that needed recovery. In setting up a regex to filter the trace down to something managable, I noticed that the id of the EFI that hung recovery matched an EFI in the very first checkpoint being recovered from the log (from the logprint output). Logprint told me that this checkpoint was almost entirely EFI/EFDs from an unlinked inode being inactivated. The last two log items in that checkpoint were the inode item and a single EFI. A quick grep failed to find that inode anywhere else in the logprint, and the EFI id didn't show up in any EFD item, either. Pulling them out of the mount trace confirmed that they only appeared in recovery of the initial checkpoint. IOWs, the EFI - and the inode that was locked across the processing of the EFI at runtime - had pinned the tail of the log and that's why recovery hung waiting on log space. It turns out that the hang in log recovery followed a system reboot due to a hung task timeout. So right now, it looks to me like the runtime code had deadlocked the journal because transaction reservation during a roll couldn't make progress because the inode held locked across the transaction roll pinned the tail of the log.... How that happened, I have no idea, but it was the _process_ of writing filters to analysis the logprint data and log reservation traces that allowed me to find this, not because I had a set of canned scripts that exposed that explicit information. Even if I write a scanned script to find unmatched EFIs in a log print, the log print output is sufficiently unreliable that I couldn't rely on the output of the filter script anyway..... Fundamentally, I don't have the time to run this trace+filter process this for every report of similar problems - it takes a long time just to work out what what I need to look for in the traces. Hence it's important that more people learn the process - how to read and filter the traces to isolate something that might be going wrong. I don't have canned scripts because the scripts I write are guided by what I think I need to search for, and I regularly get to a dead end and have to start the filtering process all over again to look for something else entirely. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx