Re: [xfstests generic/648] 64k directory block size (-n size=65536) crash on _xfs_buf_ioapply

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 22 Jan 2024 22:21:07 +1100

On Mon, Jan 22, 2024 at 03:23:12PM +0800, Zorro Lang wrote:
> On Sun, Jan 21, 2024 at 10:58:49AM +1100, Dave Chinner wrote:
> > On Sat, Jan 20, 2024 at 07:26:00PM +0800, Zorro Lang wrote:
> > > On Fri, Jan 19, 2024 at 06:17:24PM +1100, Dave Chinner wrote:
> > > > Perhaps a bisect from 6.7 to 6.7+linux-xfs/for-next to identify what
> > > > fixed it? Nothing in the for-next branch really looks relevant to
> > > > the problem to me....
> > > 
> > > Hi Dave,
> > > 
> > > Finally, I got a chance to reproduce this issue on latest upstream mainline
> > > linux (HEAD=9d64bf433c53) (and linux-xfs) again.
> > > 
> > > Looks like some userspace updates hide the issue, but I haven't found out what
> > > change does that, due to it's a big change about a whole system version. I
> > > reproduced this issue again by using an old RHEL distro (but the kernel is the newest).
> > > (I'll try to find out what changes cause that later if it's necessary)
> > > 
> > > Anyway, I enabled the "CONFIG_XFS_ASSERT_FATAL=y" and "CONFIG_XFS_DEBUG=y" as
> > > you suggested. And got the xfs metadump file after it crashed [1] and rebooted.
> > > 
> > > Due to g/648 tests on a loopimg in SCRATCH_MNT, so I didn't dump the SCRATCH_DEV,
> > > but dumped the $SCRATCH_MNT/testfs file, you can get the metadump file from:
> > > 
> > > https://drive.google.com/file/d/14q7iRl7vFyrEKvv_Wqqwlue6vHGdIFO1/view?usp=sharing
> > 
> > Ok, I forgot the log on s390 is in big endian format. I don't have a
> > bigendian machine here, so I can't replay the log to trace it or
> > find out what disk address the buffer belongs. I can't even use
> > xfs_logprint to dump the log.
> > 
> > Can you take that metadump, restore it on the s390 machine, and
> > trace a mount attempt? i.e in one shell run 'trace-cmd record -e
> > xfs\*' and then in another shell run 'mount testfs.img /mnt/test'
> 
> The 'mount testfs.img /mnt/test' will crash the kernel and reboot
> the system directly ...

Turn off panic-on-oops. Some thing like 'echo 0 >
/proc/sys/kernel/panic_on_oops' will do that, I think.

> > and then after the assert fail terminate the tracing and run
> > 'trace-cmd report > testfs.trace.txt'?
> 
> ... Can I still get the trace report after rebooting?

Not that I know of. But, then again, I don't reboot test machines
when an oops or assert fail occurs - I like to have a warm corpse
left behind that I can poke around in with various blunt instruments
to see what went wrong....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx