Re: xfs_repair segfaults

Ole Tange <tange@xxxxxxxxxx> · Fri, 8 Mar 2013 11:09:40 +0100

On Tue, Mar 5, 2013 at 12:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Mar 04, 2013 at 10:03:29AM +0100, Ole Tange wrote:
>> On Fri, Mar 1, 2013 at 9:53 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> :
>> > What filesystem errors occurred
>> > when the srives went offline?
>>
>> See http://dna.ku.dk/~tange/tmp/syslog.3
>
> You log is full of this:
>
> mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
>
> What's that mean?

We do not know, but it is something we are continually trying to find
out. We have 5 other systems using the same setup and they experience
the same.

1 of these 5 systems drop disks off the RAID but the rest work fine.
In other words: we do not experience data corruption - only disk
dropping of the RAID. That leads me to believe it is some kind of
timeout error.

>> Feb 26 00:46:52 franklin kernel: [556238.429259] XFS (md5p1): metadata
>> I/O error: block 0x459b8 ("xfs_buf_iodone_callbacks") error 5 buf
>> count 4096
>
> So, the first IO errors appear at 23:00 on /dev/sdb, and the
> controller does a full reset and reprobe. Look slike a port failure
> of some kind. Notable:
>
> mpt2sas1: LSISAS2008: FWVersion(07.00.00.00), ChipRevision(0x03), BiosVersion(07.11.10.00)
>
> From a quick google, that firmware looks out of date (current
> LSISAS2008 firmwares are numbered 10 or 11, and bios versions are at
> 7.21).

We have tried updating the firmware using LSIs own tool. That fails as
LSI tools says the firmware is not signed correctly.

> /dev/md4 had a failure (/dev/sds) around 00:19, no rebuild started.

The rebuild of md4 is now complete.

> /dev/md1 had another failure (/dev/sdj) around 00:46, this time on a
> SYNCHRONISE CACHE command (i.e. log write). This IO failure caused
> the shutdown to occur. And this is the result:
>
> [556219.292225] end_request: I/O error, dev sdj, sector 10
> [556219.292275] md: super_written gets error=-5, uptodate=0
> [556219.292283] md/raid:md1: Disk failure on sdj, disabling device.
> [556219.292286] md/raid:md1: Operation continuing on 7 devices.
>
> At this point, /dev/md1 is reporting 7 working disks and has had an
> EIO on it's superblock write, which means it's probably in an
> inconsistent state. Further, it's only got 8 disks associated with
> it and as a rebuild is in progress it means that data loss has
> occurred with this failure. There's your problem.

Yep. What I would like to see from xfs_repair is salvaging the part
that is not affected - which ought to be the primary part of the 100
TB.

> Essentially, you need to fix your hardware before you do anything
> else. Get it all back fully online and fix whatever the problems are
> that are causing IO errors, then you can worry about recovering the
> filesystem and your data. Until the hardware is stable and not
> throwing errors, recovery is going to be unreliable (if not
> impossible).

As that has been an ongoing effort it is unlikely to be solved within
a short timeframe.

/Ole

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs