On Wed, Nov 30, 2016 at 02:07:39PM +0100, Christian Theune wrote: > Hi there, > > we’re running a Ceph cluster which had a very rough outage not > long ago[1]. > > When updating our previous kernels from 4.1.16 (Gentoo) to 4.4.27 > (Gentoo) we encountered the following problem in our production > environment (but not in staging or development): Hi Christian - thanks for perservering and getting this report to the list. :P > > - Properly shut down and reboot the machine running Ceph OSDs on XFS w/ kernel 4.1.16. > - Boot with 4.4.27, let the machine mount the FS’ and start OSDs > - Have everything run 20-30 minutes > - Ceph OSDs start crashing. Kernel shows messages attached in kern.log Which shouldn't happen. I'm pretty sure it's the AGFL packing change that has caused the problem here, but I'm still paging all that back into memory and clearing out all the other little things I need to before digging back into this. I have a couple of ideas about how this could occur: > An interesting error we saw during repair was this (I can’t remember or reconstruct whether this was on the 4.1 or 4.4 kernel): > > bad agbno 4294967295 in agfl, agno 12 > freeblk count 7 != flcount 6 in ag 12 > sb_fdblocks 82969993, counted 82969994 Because this: > Note, that the agbno is 2**32-1 repeatedly is NULLAGBNO, which is what the AGFL is initialised to by mkfs, and indicates we're accessing a slot that hasn't been filled correctly. > Also interesting: the broken filesystems and xfs_repair behaved > completely differently whether talked to from a 4.1 or 4.4 kernel, > thus the pattern of first running xfs_repair on 4.1 and then again > on 4.4. Yup, I'd expect that given that xfs_repair has the same AGFL packing issue and what it ends up with is dependent on whether the packing matches the kernel being run or not... > This looks similar to [2] and may be related to the already fixed > bug referenced by Dave in [3], but in our case there was no 32/64 > bit migration involved. That was the initial discovery vector, but looking into this again I suspect the issue is packing changes the slot indexing. I do have a patchset where I started trying to fix all this up automatically, and so I need to go back to that and sort out where I was up to and see if I was addressing this index offset problem at all. This is where I previously got up to: https://www.spinics.net/lists/linux-xfs/msg00445.html More tomorrow once I've dug in further... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html