Re: Failure growing xfs with linux 3.10.5

Michael Maier <m1278468@xxxxxxxxxxx> · Thu, 15 Aug 2013 20:35:54 +0200

Eric Sandeen wrote:
> On 8/15/13 12:55 PM, Michael Maier wrote:
>> Eric Sandeen wrote:
>>> On 8/14/13 11:20 AM, Michael Maier wrote:
>>>> Dave Chinner wrote:
>>>
>>> ...
>>>
>>>>> If it makes you feel any better, the bug that caused this had been
>>>>> in the code for 15+ years and you are the first person I know of to
>>>>> have ever hit it....
>>>>
>>>> Probably the second one :-) See
>>>> http://thread.gmane.org/gmane.comp.file-systems.xfs.general/54428
>>>>
>>>>> xfs_repair doesn't appear to have any checks in it to detect this
>>>>> situation or repair it - there are some conditions for zeroing the
>>>>> unused parts of a superblock, but they are focussed around detecting
>>>>> and correcting damage caused by a buggy Irix 6.5-beta mkfs from 15
>>>>> years ago.
>>>>
>>>> The _big problem_ is: xfs_repair not just doesn't repair it, but it
>>>> _causes data loss_ in some situations!
>>>>
>>>
>>> So as far as I can tell at this point, a few things have happened to
>>> result in this unfortunate situation.  Congratulations, you hit a
>>> perfect storm.  :(
>>
>> I can appease you - as it "only" hit my backup device and because I
>> noticed the problem before I really needed it: I didn't hit any data
>> loss over all, because the original data is ok and I repeated the backup
>> w/ the fixed FS now!
>>
>>> 1) prior resize operations populated unused portions of backup sbs w/ junk
>>> 2) newer kernels fail to verify superblocks in this state
>>> 3) during your growfs under 3.10, that verification failure aborted
>>>    backup superblock updates, leaving many unmodified
>>> 4a) xfs_repair doesn't find or fix the junk in the backup sbs, and
>>> 4b) when running, it looks for the superblocks which are "most matching"
>>>     other superblocks on the disk, and takes that version as correct.
>>>
>>> So you had 16 superblocks (0-15) which were correct after the growfs.
>>> But 16 didn't verify and was aborted, so nothing was updated after that.
>>> This means that 16 onward have the wrong number of AGs and disk blocks;
>>> i.e. they are the pre-growfs size, and there are 26 of them.
>>>
>>> Today, xfs_repair sees this 26-to-16 vote, and decides that the 26
>>> matching superblocks "win," rewrites the first superblock with this
>>> geometry, and uses that to verify the rest of the filesytem.  Hence
>>> anything post-growfs looks out of bounds, and gets nuked.
>>>
>>> So right now, I'm thinking that the "proper geometry" heuristic should
>>> be adjusted, but how to do that in general, I'm not sure.  Weighting
>>> sb 0 heavily, especially if it matches many subsequent superblocks,
>>> seems somewhat reasonable.
>>
>> This would have been my next question! I repaired it w/ the git
>> xfs_repair on the already reduced to original size FS. I think, if I
>> would have done the same w/ the grown FS, the FS most probably would be
>> reduced to the size before the growing.
>>
>> Wouldn't it be better to not grow at all if there are problems detected?
>> Means: Don't do the check after the growing, but before? Ok, I could
>> have done it myself ... . From now on, I will do it like this!
> 
> well, see the next couple patches I'm about to send to the list ... ;)

Cool!

> but a check prior wouldn't have helped you, because repair didn't detect
> the problem that growfs choked on.

The old xfs_repair! Your patched one would have detected the problem if
I got it right.

But globally speaking: you're right - it's impossible to get 100%
security. But couldn't xfs_repair -n find other problems which therefore
could be repaired before growing the FS?

Thanks,
regards,
Michael

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs