Re: Weird XFS Corruption Error

Sascha Askani <saskani@xxxxxxxxx> · Fri, 24 Jan 2014 08:56:32 +0100

Hi Dave, 

thanks for your reply and I’m sorry for the delayed answer…

Am 23.01.2014 um 00:31 schrieb Dave Chinner <david@xxxxxxxxxxxxx>:

> On Wed, Jan 22, 2014 at 05:09:10PM +0100, Sascha Askani wrote:
> 
> So, an inode extent map btree block failed verification for some
> reason. Hmmm - there should have been 4 lines of hexdump output
> there as well. Can you post that as well? Or have you modified
> /proc/sys/fs/xfs/error_level to have a value of 0 so it is not
> emitted?
> 

/proc/sys/fs/xfs/error_level is set to 3, sorry for not including this in my original post, the Hexdump is pretty „boring“ (or interesting, depending on your point of view):

[964197.435322] ffff881f8e59b000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[964197.862037] ffff881f8e59b010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[964198.288694] ffff881f8e59b020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[964198.712093] ffff881f8e59b030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

> And not the disk address of the buffer? 0x1f0 - it's right near the
> start of the volume.
> 
> 
>> [964199.139324] XFS (dm-8): I/O Error Detected. Shutting down filesystem
>> [964199.139325] XFS (dm-8): Please umount the filesystem and rectify the problem(s)
>> [964212.367300] XFS (dm-8): xfs_log_force: error 5 returned.
>> [964242.477283] XFS (dm-8): xfs_log_force: error 5 returned.
>> ---------------------------------------------------------
>> 
>> After that, I tried the following (in order):
> 
> Do you have the output and log messages from these steps? That would
> be realy helpful in confirming any diagnosis.

Unfortunately, the output got lost due to a reboot, but basically xfs_repair scanned the whole volume after failing to find a primary superblock, emitting Millions of dots in the process.

> 
>> 1. xfs_repair, which did not find the superblock and started scanning the LV, after finding the secondary superblock, it told me there’s still something in the log, so I
> 
> Oh, wow. Ok, if the primary superblock is gone, along with metadata
> in the first few blocks of the filesystem, then something has
> overwritten the start of the block device the filesystem is on.
> 
>> 2. mounted the filesystem, which gave me a „Structure needs cleaning“ after a couple of seconds
>> 3. tried mounting again for good measure, same error „Structure needs cleaning“
> 
> Right - the kernel can't read a valid superlock, either.

Just seen this messages in the log which were emitted when trying to mount the FS:

[964606.038733] XFS (dm-8): metadata I/O error: block 0x200 ("xlog_recover_do..(read#2)") error 117 numblks 16
[964606.515048] XFS (dm-8): log mount/recovery failed: error 117
[964606.515386] XFS (dm-8): log mount failed

> 
>> 4. xfs_repair -L which repaired everything, and effectively cleaned my Filesystem in the process.
> 
> Recreating the primary superblock from the backup superblocks
> 
>> 5. mount the filesystem to find it empty.
> 
> Because the root inode was lost, along with AGI 0 and so all
> the inodes in the first AG were completely lost as all the redundant
> information that is used to find them was trashed.

Yes, and since at the time of the error, there were only 2 Files, 1 Directory and 2 Hardlinks on the fs, so it’s kind of probable that everything is lost.

> 
>> Since then, I’m desperately trying to reproduce the problem,
>> but unfortunately to no avail. Can somebody give some insight on
>> the errors I encountered. I have previously operated 4,5PB worth
>> of XFS Filesystems for 3 years and never got an error similar to
>> this.
> 
> This doesn't look like an XFS problem. This looks like something
> overwrote the start of the block device underneath the XFS
> filesystem. I've seen this happen before with faulty SSDs, I've also
> seen it when someone issued a discard to the wrong location on a
> block device (you didn't run fstrim on the block device, did you?),
> and I've seen faulty RAID controllers cause similar issues. So right
> now I'd be looking at logs and so on for hardware/storage issues
> that occurred in the past couple of days as potential causes.

No, we did not perform any kind of trimming on the device, also, there are no „discard“ options set anywhere (mount-options, lvm.conf,…). We have a pretty active MariaDB-Slave running on the same Controller Logical Drive / LVM VG and no errors on the other filesystems so far; also, mylvmbackup does not seem to have any problems. 

Thanks for your insights so far, if you need any more information, I’d be happy to provide it if possible.

Best regards,

Sascha 
Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs