Hi there, we’re running a Ceph cluster which had a very rough outage not long ago[1]. When updating our previous kernels from 4.1.16 (Gentoo) to 4.4.27 (Gentoo) we encountered the following problem in our production environment (but not in staging or development): - Properly shut down and reboot the machine running Ceph OSDs on XFS w/ kernel 4.1.16. - Boot with 4.4.27, let the machine mount the FS’ and start OSDs - Have everything run 20-30 minutes - Ceph OSDs start crashing. Kernel shows messages attached in kern.log - Panic. Breath. - The RAID controllers (LSI) did not exhibit any sign of disk problems at all. - Trying to interact with the crashed FS’, i.e.through xfs_repair, caused infinitely hanging syscalls. Clean reboot was no longer possible at that point. After some experimentation the way to clean things up with negligible residual harm was: - reboot into 4.1 kernel - run xfs_repair, force the journal to be cleaned with -L (in some instances) - ensure a second xfs_repair ends up clean, as well after a mount/umount cycle - reboot into 4.4 kernel - run xfs_repair again, ensure it eventually becomes clean, and stays that way after mount/unmount as well as a reboot cycle An interesting error we saw during repair was this (I can’t remember or reconstruct whether this was on the 4.1 or 4.4 kernel): bad agbno 4294967295 in agfl, agno 12 freeblk count 7 != flcount 6 in ag 12 sb_fdblocks 82969993, counted 82969994 bad agbno 4294967295 in agfl, agno 13 freeblk count 7 != flcount 6 in ag 13 sb_fdblocks 98156324, counted 98156325 Note, that the agbno is 2**32-1 repeatedly and the sb_fdblocks is off-by-one. I personally don’t have enough internal XFS knowledge, but to me this smells “interesting”. Also interesting: the broken filesystems and xfs_repair behaved completely differently whether talked to from a 4.1 or 4.4 kernel, thus the pattern of first running xfs_repair on 4.1 and then again on 4.4.
Attachment:
kern.log.gz
Description: GNU Zip compressed data
This looks similar to [2] and may be related to the already fixed bug referenced by Dave in [3], but in our case there was no 32/64 bit migration involved. I’d love if someone could check whether this is a new bug - I reviewed all kernel logs since the old kernel we had but could not find anything that I can pinpoint to our situation. Unfortunately, my notes aren’t as complete as I would have liked them to be, let me know if you need anything specific, I’ll do my best to dig it up. Cheers and thanks in advance, Christian [1] http://status.flyingcircus.io/incidents/h37gk5v81nz5 [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1576599 [3] https://plus.google.com/u/0/+FlorianHaas/posts/LNYMKQF7rgU -- Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick