Re: Weird XFS Corruption Error

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 23 Jan 2014 10:31:41 +1100

On Wed, Jan 22, 2014 at 05:09:10PM +0100, Sascha Askani wrote:
> Hi everybody,
> 
> We experienced a weird XFS corruption yesterday and I desperately trying to find out what was happening.
> First, the setup:
> 
> * ProLiant DL380p Gen8
> * 256GB RAM
> * HP SmartArray P420i Controller
> ** 1 GB BBWC
> ** Firmware Version 4.68
> ** 20x MK0100GCTYU 100GB SSD Drives
> ** Raid 1+0
> * LVM
> * Ubuntu 12.10 LTS
> * Kernel 3.11.0-15-generic #23~precise1-Ubuntu
> 
> fstab Entry: 
> /dev/vg00/opt_mysqlbackup   /opt/mysqlbackup            xfs     nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k       0 2
> 

> We created a 120GB LV mounted on /opt/mysqlbackup with which
> (obviously) temporarily hosts our MariaDB Backups until they are
> transferred to tape. We use mylvmbackup
> (http://www.lenzg.net/mylvmbackup/) to create a (approx. 55GB)
> tar.gz file containing the dump. While testing, I created a
> hardlink for 2 Files in a subdir („safe“) and forgot them
> for a day while the „original“ file was deleted and
> replaced by next day’s backup.
> 
> When I tried cleaning up the no longer needed files, I encountered the following:
> 
> ---------------------------------------------------------
> me@hsoi-gts3-de02:/opt/mysqlbackup$ sudo rm -rf safe/
> sudo rm -rf safe/
> [sudo] password for saskani:
> rm: cannot remove `safe/daily_snapshot.tar.gz.md5': Input/output error
> ---------------------------------------------------------
> 
> dmesg told me:
> ---------------------------------------------------------
> [964199.138848] XFS (dm-8): Internal error xfs_bmbt_read_verify at line 789 of file /build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_bmap_btree.c.  Caller 0xffffffffa0164495
> [964199.138848]
> [964199.138850] CPU: 1 PID: 3694 Comm: kworker/1:1H Tainted: GF            3.11.0-15-generic #23~precise1-Ubuntu
> [964199.138851] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 09/18/2013
> [964199.138874] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
> [964199.138876]  0000000000000001 ffff881c6be6fd18 ffffffff8173bc0e 0000000000004364
> [964199.138878]  ffff883f9061c000 ffff881c6be6fd38 ffffffffa016629f ffffffffa0164495
> [964199.138879]  0000000000000001 ffff881c6be6fd78 ffffffffa016630e ffff881c6be6fda8
> [964199.138880] Call Trace:
> [964199.138886]  [<ffffffff8173bc0e>] dump_stack+0x46/0x58
> [964199.138906]  [<ffffffffa016629f>] xfs_error_report+0x3f/0x50 [xfs]
> [964199.138913]  [<ffffffffa0164495>] ? xfs_buf_iodone_work+0x95/0xc0 [xfs]
> [964199.138921]  [<ffffffffa016630e>] xfs_corruption_error+0x5e/0x90 [xfs]
> [964199.138928]  [<ffffffffa0164495>] ? xfs_buf_iodone_work+0x95/0xc0 [xfs]
> [964199.138939]  [<ffffffffa01944d6>] xfs_bmbt_read_verify+0x76/0xf0 [xfs]
> [964199.138946]  [<ffffffffa0164495>] ? xfs_buf_iodone_work+0x95/0xc0 [xfs]
> [964199.138949]  [<ffffffff81095bb2>] ? finish_task_switch+0x52/0xf0
> [964199.138969]  [<ffffffffa0164495>] xfs_buf_iodone_work+0x95/0xc0 [xfs]
> [964199.138972]  [<ffffffff81081060>] process_one_work+0x170/0x4a0
> [964199.138973]  [<ffffffff81082121>] worker_thread+0x121/0x390
> [964199.138975]  [<ffffffff81082000>] ? manage_workers.isra.21+0x170/0x170
> [964199.138977]  [<ffffffff81088fe0>] kthread+0xc0/0xd0
> [964199.138979]  [<ffffffff81088f20>] ? flush_kthread_worker+0xb0/0xb0
> [964199.138981]  [<ffffffff817508ac>] ret_from_fork+0x7c/0xb0
> [964199.138983]  [<ffffffff81088f20>] ? flush_kthread_worker+0xb0/0xb0
> [964199.138984] XFS (dm-8): Corruption detected. Unmount and run xfs_repair
> [964199.139014] XFS (dm-8): metadata I/O error: block 0x1f0 ("xfs_trans_read_buf_map") error 117 numblks 8
> [964199.139016] XFS (dm-8): xfs_do_force_shutdown(0x1) called from line 367 of file /build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffa01cadbc

So, an inode extent map btree block failed verification for some
reason. Hmmm - there should have been 4 lines of hexdump output
there as well. Can you post that as well? Or have you modified
/proc/sys/fs/xfs/error_level to have a value of 0 so it is not
emitted?

And not the disk address of the buffer? 0x1f0 - it's right near the
start of the volume.

> [964199.139324] XFS (dm-8): I/O Error Detected. Shutting down filesystem
> [964199.139325] XFS (dm-8): Please umount the filesystem and rectify the problem(s)
> [964212.367300] XFS (dm-8): xfs_log_force: error 5 returned.
> [964242.477283] XFS (dm-8): xfs_log_force: error 5 returned.
> ---------------------------------------------------------
> 
> After that, I tried the following (in order):

Do you have the output and log messages from these steps? That would
be realy helpful in confirming any diagnosis.

> 1. xfs_repair, which did not find the superblock and started scanning the LV, after finding the secondary superblock, it told me there’s still something in the log, so I

Oh, wow. Ok, if the primary superblock is gone, along with metadata
in the first few blocks of the filesystem, then something has
overwritten the start of the block device the filesystem is on.

> 2. mounted the filesystem, which gave me a „Structure needs cleaning“ after a couple of seconds
> 3. tried mounting again for good measure, same error „Structure needs cleaning“

Right - the kernel can't read a valid superlock, either.

> 4. xfs_repair -L which repaired everything, and effectively cleaned my Filesystem in the process.

Recreating the primary superblock from the backup superblocks

> 5. mount the filesystem to find it empty.

Because the root inode was lost, along with AGI 0 and so all
the inodes in the first AG were completely lost as all the redundant
information that is used to find them was trashed.

> Since then, I’m desperately trying to reproduce the problem,
> but unfortunately to no avail. Can somebody give some insight on
> the errors I encountered. I have previously operated 4,5PB worth
> of XFS Filesystems for 3 years and never got an error similar to
> this.

This doesn't look like an XFS problem. This looks like something
overwrote the start of the block device underneath the XFS
filesystem. I've seen this happen before with faulty SSDs, I've also
seen it when someone issued a discard to the wrong location on a
block device (you didn't run fstrim on the block device, did you?),
and I've seen faulty RAID controllers cause similar issues. So right
now I'd be looking at logs and so on for hardware/storage issues
that occurred in the past couple of days as potential causes.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs