Re: Question: reserve log space at IO time for recover

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 27 Jul 2023 11:05:26 +1000

On Wed, Jul 26, 2023 at 08:23:20AM -0700, Darrick J. Wong wrote:
> On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote:
> > On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> > > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> > > > i.e. we have full tracing of the log reservation accounting via
> > > > tracepoints in the kernel. If there is a leak occurring, you need to
> > > > capture a trace of all the reservation accounting operations and
> > > > post process the output to find out what operation is leaking
> > > > reserved space. e.g.
> > > > 
> > > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> > > > ....
> > > > # trace-cmd report > s.t
> > > > # head -3 s.t
> > > > cpus=16
> > > >          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > > >          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > > > 
> > > > #
> > > > 
> > > > So this tells us the transaction reservation unit size, the count of
> > > > reservations, the current reserve and grant head locations, and the
> > > > current head and tail of the log at the time the transaction
> > > > reservation is started and then after it completes.
> > > 
> > > Will do that and report back. You want full log or only some typical
> > > ones? Full log would be big, how shall I share? 
> > 
> > I don't want to see the log. It'll be huge - I regularly generate
> > traces containing gigabytes of log accounting traces like this from
> > a single workload.
> > 
> > What I'm asking you to do is run the tracing and then post process
> > the values from the trace to determine what operation is using more
> > space than is being freed back to the log.
> > 
> > I generally do this with grep, awk and sed. some people use python
> > or perl. But either way it's a *lot* of work - in the past I have
> > spent _weeks_ on trace analysis to find a 4 byte leak in the log
> > space accounting. DOing things like graphing the head, tail and grant
> > spaces over time tend to show if this is a gradual leak versus a
> > sudden step change. If it's a sudden step change, then you can
> > isolate it in the trace and work out what happened. If it's a
> > gradual change, then you need to start looking for accounting
> > discrepancies...
> 
> Any chance you'd be willing to share that pipeline?

It's all just one-off scripts I tend to write from scratch each
time. I'm always looking for different things, and I never know what
I'm looking for ahead of time, so the parser is always different....

> It'd be useful to
> stash that kind of debugging program in xfsprogs/tools to save time and
> eliminate an entire class of "Hey Dave, did I capture this correctly?"
> questions.
> 
> (At least until someone changes the tracepoints :P)

That happens frequently enough... :/

Regardless, it's the process of writing filter scripts that helps me
find the problem I'm looking for, not the actual filter script
itself.

e.g. I've been working on a similar "EFI reservation deadlocks on
log space during recovery" issue over the past couple of days from a
customer metadump.

The filesystem had a 2GB log, and it was full. Log recovery took a
several minutes to run there were so many items to recover. The
(incomplete) logprint of all the op headers was over a million lines
(somewhere around 400k individual log items) and I know that
logprint output skipped at least two checkpoints that had over
65,000 ophdrs in them each.

So log recovery was recovering roughly half a million log items from
the log, and only then it was hanging on the first EFI that needed
to be recovered. The mount trace was somewhere around 20GB of event
data at the point of the hang, the logprint output indicated about
1.3 million ophdrs in the log that were being processed by recovery.
I worked that out to be around half a million individual log items
in the log that needed recovery.

In setting up a regex to filter the trace down to something
managable, I noticed that the id of the EFI that hung recovery
matched an EFI in the very first checkpoint being recovered from the
log (from the logprint output). Logprint told me that this
checkpoint was almost entirely EFI/EFDs from an unlinked inode being
inactivated. The last two log items in that checkpoint were the
inode item and a single EFI. 

A quick grep failed to find that inode anywhere else in the
logprint, and the EFI id didn't show up in any EFD item, either.
Pulling them out of the mount trace confirmed that they only
appeared in recovery of the initial checkpoint. IOWs, the EFI -
and the inode that was locked across the processing of the EFI at
runtime - had pinned the tail of the log and that's why recovery
hung waiting on log space.

It turns out that the hang in log recovery followed a system
reboot due to a hung task timeout. So right now, it looks to me like
the runtime code had deadlocked the journal because transaction
reservation during a roll couldn't make progress because the inode
held locked across the transaction roll pinned the tail of the
log....

How that happened, I have no idea, but it was the _process_ of
writing filters to analysis the logprint data and log reservation
traces that allowed me to find this, not because I had a set of
canned scripts that exposed that explicit information. Even if I
write a scanned script to find unmatched EFIs in a log print, the
log print output is sufficiently unreliable that I couldn't rely on
the output of the filter script anyway.....

Fundamentally, I don't have the time to run this trace+filter
process this for every report of similar problems - it takes a long
time just to work out what what I need to look for in the traces.
Hence it's important that more people learn the process - how to
read and filter the traces to isolate something that might be going
wrong. I don't have canned scripts because the scripts I write are
guided by what I think I need to search for, and I regularly get to
a dead end and have to start the filtering process all over
again to look for something else entirely.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx