XFS bug?

Christian Theune <ct@xxxxxxxxxxxxxxx> · Wed, 30 Nov 2016 14:07:39 +0100

Hi there,

we’re running a Ceph cluster which had a very rough outage not long ago[1].

When updating our previous kernels from 4.1.16 (Gentoo) to 4.4.27 (Gentoo) we encountered the following problem in our production environment (but not in staging or development):

- Properly shut down and reboot the machine running Ceph OSDs on XFS w/ kernel 4.1.16.
- Boot with 4.4.27, let the machine mount the FS’ and start OSDs
- Have everything run 20-30 minutes
- Ceph OSDs start crashing. Kernel shows messages attached in kern.log
- Panic. Breath. 
- The RAID controllers (LSI) did not exhibit any sign of disk problems at all.
- Trying to interact with the crashed FS’, i.e.through xfs_repair, caused infinitely hanging syscalls. Clean reboot was no longer possible at that point.

After some experimentation the way to clean things up with negligible residual harm was:

- reboot into 4.1 kernel
- run xfs_repair, force the journal to be cleaned with -L (in some instances)
- ensure a second xfs_repair ends up clean, as well after a mount/umount cycle
- reboot into 4.4 kernel
- run xfs_repair again, ensure it eventually becomes clean, and stays that way after mount/unmount as well as a reboot cycle

An interesting error we saw during repair was this (I can’t remember or reconstruct whether this was on the 4.1 or 4.4 kernel):

bad agbno 4294967295 in agfl, agno 12
freeblk count 7 != flcount 6 in ag 12
sb_fdblocks 82969993, counted 82969994

bad agbno 4294967295 in agfl, agno 13
freeblk count 7 != flcount 6 in ag 13
sb_fdblocks 98156324, counted 98156325

Note, that the agbno is 2**32-1 repeatedly and the sb_fdblocks is off-by-one. I personally don’t have enough internal XFS knowledge, but to me this smells “interesting”.

Also interesting: the broken filesystems and xfs_repair behaved completely differently whether talked to from a 4.1 or 4.4 kernel, thus the pattern of first running xfs_repair on 4.1 and then again on 4.4.

Attachment:
kern.log.gz

Description: GNU Zip compressed data

This looks similar to [2] and may be related to the already fixed bug referenced by Dave in [3], but in our case there was no 32/64 bit migration involved.

I’d love if someone could check whether this is a new bug - I reviewed all kernel logs since the old kernel we had but could not find anything that I can pinpoint to our situation.

Unfortunately, my notes aren’t as complete as I would have liked them to be, let me know if you need anything specific, I’ll do my best to dig it up.

Cheers and thanks in advance,
Christian

[1] http://status.flyingcircus.io/incidents/h37gk5v81nz5
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1576599
[3] https://plus.google.com/u/0/+FlorianHaas/posts/LNYMKQF7rgU

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick