Re: XFS Repair Error

Rich Otero <rotero@xxxxxxxxxxxxx> · Wed, 26 Jun 2019 17:30:33 -0400

> This applies
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

kernel version: 3.12.17
xfsprogs version: 3.1.7
number of CPUs: 12
contents of /proc/meminfo: 32 GiB RAM, 8 GiB swap, memory pressure on
this server is generally very low
contents of /proc/mounts: /dev/sdb1 /RAIDS/RAID_1 xfs
rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=2048,usrquota,grpquota
0 0
contents of /proc/partitions: 8       17 54690576384 sdb1
RAID layout: /dev/sdb is a 16-disk RAID-6 on a Broadcom MegaRAID
9361-series card
LVM configuration: none
type of disks you are using: WDC RE 4 TB SAS (WD4001FYYG-01SL3)
write cache status of drives: MegaRAID card has writeback enabled for this RAID
size of BBWC and mode it is running in: unknown
xfs_info output on the filesystem in question: no longer available
dmesg output showing all error messages and stack traces: no longer available

> Also, are the disk failures fixed? Is the RAID happy? I'm very
> skeptical of writing anything, including repairs, let alone rw
> mounting, a file system that's one a busted or questionably working
> storage stack. The storage stack needs to be in working order first.
> Is it?

This particular server is used for development purposes and the data
stored on it is replicated on other servers, so the integrity of the
data is not very important. We have used XFS in our storage products
for 15 years, mostly on RAID-5 and RAID-6 using LSI 3ware and Broadcom
MegaRAID cards. It is not uncommon for disks to fail and be replaced
and for the RAID to rebuild while the XFS is still in use, and we very
rarely experience XFS problems during or after the rebuild. In this
particular case, we suspected a malfunctioning RAID card and replaced
it, and we are replacing some faulty disks.

> OK why -L ? Was there a previous mount attempt and if so when kernel
> errors? Was there a previous repair attempt without -L? -L is a heavy
> hammer that shouldn't be needed unless the log is damaged and if the
> log is damaged or otherwise can't be replayed, you should get a kernel
> message about that.

Previously, mounting the XFS failed because the "structure must be
cleaned." That led to the first attempt at xfs_repair without -L,
which ended in an error complaining that the journal needed to be
replayed. But since I couldn't mount, that was impossible, so the
second xfs_repair attempt was with -L.

I needed to make this server functional again quickly, and since I
didn't care about losing the data, I simply reformatted the RAID
(`mkfs.xfs -f`), so I won't be able to reproduce the xfs_repair error.
In my eight years using XFS, I've never seen that error before, so I
thought it would be interesting to report to the list and see what I
could learn about it.

Regards,
Rich Otero
EditShare
rotero@xxxxxxxxxxxxx
617-782-0479

On Wed, Jun 26, 2019 at 5:04 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Jun 26, 2019 at 2:32 PM Rich Otero <rotero@xxxxxxxxxxxxx> wrote:
> >
> > I have an XFS filesystem of approximately 56 TB on a RAID that has
> > been experiencing some disk failures. The disk problems seem to have
> > led to filesystem corruption, so I attempted to repair the filesystem
>
> This applies
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> Also, are the disk failures fixed? Is the RAID happy? I'm very
> skeptical of writing anything, including repairs, let alone rw
> mounting, a file system that's one a busted or questionably working
> storage stack. The storage stack needs to be in working order first.
> Is it?
>
> > with `xfs_repair -L <device>`. Xfs_repair finished with a message
> > stating that an error occurred and to report the bug.
>
> OK why -L ? Was there a previous mount attempt and if so when kernel
> errors? Was there a previous repair attempt without -L? -L is a heavy
> hammer that shouldn't be needed unless the log is damaged and if the
> log is damaged or otherwise can't be replayed, you should get a kernel
> message about that.
>
>
> --
> Chris Murphy