Re: XFS: corruption detected in 5.9.10, upgrading from 5.9.6: (partial) panic log

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Sun, 22 Nov 2020 11:37:11 -0800

On Sun, Nov 22, 2020 at 06:38:28PM +0000, Nick Alcock wrote:
> So I just tried to reboot my x86 server box from 5.9.6 to 5.9.10 and my

Sorry about that, there was a bad patch in -rc4 that got sucked into
5.9.9 because it had a fixes tag.  The revert is already upstream:

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?id=eb8409071a1d47e3593cfe077107ac46853182ab

--D

> system oopsed with an xfs fs corruption message when I kicked up
> Chromium on another machine which mounted $HOME from the server box (it
> panicked without logging anything, because the corruption was detected
> on the rootfs, and it is also the loghost). A subsequent reboot died
> instantly as soon as it tried to mount root, but the next one got all
> the way to starting Chromium before dying again the same way.
> 
> Rebooting back into 5.9.6 causes everything to work fine again, no
> reports of corruption and starting Chromium works.
> 
> This fs has rmapbt and reflinks enabled, on a filesystem originally
> created by xfsprogs 4.10.0, but I have never knowingly used them under
> the Chromium config dirs (or, actually, under that user's $HOME at all).
> I've used them extensively elsewhere on the fs though. The FS is sitting
> above a libata -> md-raid6 -> bcache stack. (It is barely possible that
> bcache is at fault, but bcache has seen no changes since 5.9.6 so I
> doubt it.)
> 
> The relevant bits of the log I could capture -- no console scrollback
> these days, of course :( and it was a panic anyway so the top is just
> lost -- is in a photo here:
> 
>   <http://www.esperi.org.uk/~nix/temporary/xfs-crash.jpg>
> 
> The mkfs line used to create this fs was:
> 
> mkfs.xfs -m rmapbt=1,reflink=1 -d agcount=17,sunit=$((128*8)),swidth=$((384*8)) -l logdev=/dev/sde3,size=521728b -i sparse=1,maxpct=25 /dev/main/root
> 
> (/dev/sde3 is an SSD which also hosts the bcache and RAID journal,
> though this RAID device is not journalled, and is operating fine.)
> 
> I am not using a realtime device.
> 
> I have *not* yet run xfs_repair, but just rebooted back into the old
> kernel, since everything worked there: I'll run xfs_repair over the fs
> if you think it wise to do so, but right now I have a state which
> crashes on one kernel and works on another one, which seems useful to
> not try to fix in case you have some use for it.
> 
> Since everything is working fine in 5.9.6 and there were XFS changes
> after that, I'm hypothesising that this is probably a bug in the
> post-5.9.6 changes rather than anything xfs_repair should be trying to
> fix. But I really don't know :)
> 
> (I can't help but notice that all these post-5.9.6 XFS changes were
> sucked in by Sasha's magic regression-hunting stable-tree AI, which I
> thought wasn't meant to happen -- but I've not been watching closely,
> and if you changed your minds after the LWN article went in I won't have
> seen it.)