On Tue, Mar 5, 2013 at 12:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Mon, Mar 04, 2013 at 10:03:29AM +0100, Ole Tange wrote: >> On Fri, Mar 1, 2013 at 9:53 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> : >> > What filesystem errors occurred >> > when the srives went offline? >> >> See http://dna.ku.dk/~tange/tmp/syslog.3 > > You log is full of this: > > mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) > > What's that mean? We do not know, but it is something we are continually trying to find out. We have 5 other systems using the same setup and they experience the same. 1 of these 5 systems drop disks off the RAID but the rest work fine. In other words: we do not experience data corruption - only disk dropping of the RAID. That leads me to believe it is some kind of timeout error. >> Feb 26 00:46:52 franklin kernel: [556238.429259] XFS (md5p1): metadata >> I/O error: block 0x459b8 ("xfs_buf_iodone_callbacks") error 5 buf >> count 4096 > > So, the first IO errors appear at 23:00 on /dev/sdb, and the > controller does a full reset and reprobe. Look slike a port failure > of some kind. Notable: > > mpt2sas1: LSISAS2008: FWVersion(07.00.00.00), ChipRevision(0x03), BiosVersion(07.11.10.00) > > From a quick google, that firmware looks out of date (current > LSISAS2008 firmwares are numbered 10 or 11, and bios versions are at > 7.21). We have tried updating the firmware using LSIs own tool. That fails as LSI tools says the firmware is not signed correctly. > /dev/md4 had a failure (/dev/sds) around 00:19, no rebuild started. The rebuild of md4 is now complete. > /dev/md1 had another failure (/dev/sdj) around 00:46, this time on a > SYNCHRONISE CACHE command (i.e. log write). This IO failure caused > the shutdown to occur. And this is the result: > > [556219.292225] end_request: I/O error, dev sdj, sector 10 > [556219.292275] md: super_written gets error=-5, uptodate=0 > [556219.292283] md/raid:md1: Disk failure on sdj, disabling device. > [556219.292286] md/raid:md1: Operation continuing on 7 devices. > > At this point, /dev/md1 is reporting 7 working disks and has had an > EIO on it's superblock write, which means it's probably in an > inconsistent state. Further, it's only got 8 disks associated with > it and as a rebuild is in progress it means that data loss has > occurred with this failure. There's your problem. Yep. What I would like to see from xfs_repair is salvaging the part that is not affected - which ought to be the primary part of the 100 TB. > Essentially, you need to fix your hardware before you do anything > else. Get it all back fully online and fix whatever the problems are > that are causing IO errors, then you can worry about recovering the > filesystem and your data. Until the hardware is stable and not > throwing errors, recovery is going to be unreliable (if not > impossible). As that has been an ongoing effort it is unlikely to be solved within a short timeframe. /Ole _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs