Recovering from two almost simultaneously failed devices in RAID1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi there

I fear one of our mainboards did not play nicely with our SSDs in RAID1
configuration:
mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Fri Jul 27 11:58:50 2012
     Raid Level : raid1
     Array Size : 250050533 (238.47 GiB 256.05 GB)
  Used Dev Size : 250050533 (238.47 GiB 256.05 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sat Aug 10 14:58:30 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       0        0        1      removed

       1       8       33        -      faulty spare   /dev/sdc1


It seems both drives experienced some problem at around the same time,
sdc was taken offline first, but then sdd also had problems (see log at
the end of the email).

The filesystem on top of it (ext4) of course had no way of coping with
this problem, other than going to read/only.

The big questions of course are

(a) how to retrieve as much data as possible from the disks
(b) how to revive the raid system again

Any thoughts of what I should try first?

I think to tackle (a) I'll use ddrescue first, just trying to cover a
possible mistake I make later on

Cheers

Carsten


Here's the start of the log:

Aug 10 14:57:30 gitmaster kernel: [10731321.352291] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Aug 10 14:57:30 gitmaster kernel: [10731321.352350] ata3.00: failed
command: WRITE FPDMA QUEUED
Aug 10 14:57:30 gitmaster kernel: [10731321.352380] ata3.00: cmd
61/02:00:47:00:00/00:00:00:00:00/40 tag 0 ncq 1024 out
Aug 10 14:57:30 gitmaster kernel: [10731321.352380]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 10 14:57:30 gitmaster kernel: [10731321.352469] ata3.00: status: {
DRDY }
Aug 10 14:57:30 gitmaster kernel: [10731321.352495] ata3: hard resetting
link
Aug 10 14:57:30 gitmaster kernel: [10731321.352528] ata4.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Aug 10 14:57:30 gitmaster kernel: [10731321.352574] ata4.00: failed
command: WRITE FPDMA QUEUED
Aug 10 14:57:30 gitmaster kernel: [10731321.352604] ata4.00: cmd
61/02:00:47:00:00/00:00:00:00:00/40 tag 0 ncq 1024 out
Aug 10 14:57:30 gitmaster kernel: [10731321.352605]          res
40/00:00:47:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Aug 10 14:57:30 gitmaster kernel: [10731321.352695] ata4.00: status: {
DRDY }
Aug 10 14:57:30 gitmaster kernel: [10731321.352721] ata4: hard resetting
link
Aug 10 14:57:35 gitmaster kernel: [10731326.709171] ata3: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:35 gitmaster kernel: [10731326.721137] ata4: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:40 gitmaster kernel: [10731331.354487] ata3: COMRESET
failed (errno=-16)
Aug 10 14:57:40 gitmaster kernel: [10731331.354518] ata3: hard resetting
link
Aug 10 14:57:40 gitmaster kernel: [10731331.370448] ata4: COMRESET
failed (errno=-16)
Aug 10 14:57:40 gitmaster kernel: [10731331.370480] ata4: hard resetting
link
Aug 10 14:57:45 gitmaster kernel: [10731336.715383] ata3: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:45 gitmaster kernel: [10731336.735346] ata4: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:50 gitmaster kernel: [10731341.360692] ata3: COMRESET
failed (errno=-16)
Aug 10 14:57:50 gitmaster kernel: [10731341.360723] ata3: hard resetting
link
Aug 10 14:57:50 gitmaster kernel: [10731341.388654] ata4: COMRESET
failed (errno=-16)
Aug 10 14:57:50 gitmaster kernel: [10731341.388686] ata4: hard resetting
link
Aug 10 14:57:55 gitmaster kernel: [10731346.721587] ata3: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:55 gitmaster kernel: [10731346.749571] ata4: link is slow
to respond, please be patient (ready=0)
Aug 10 14:58:01 gitmaster /USR/SBIN/CRON[10885]: (root) CMD (cd
/srv/gitorious && rake ultrasphinx:index RAILS_ENV=production >
/dev/null 2>&1)
Aug 10 14:58:25 gitmaster kernel: [10731376.344429] ata3: COMRESET
failed (errno=-16)
Aug 10 14:58:25 gitmaster kernel: [10731376.344464] ata3: limiting SATA
link speed to 1.5 Gbps
Aug 10 14:58:25 gitmaster kernel: [10731376.344497] ata3: hard resetting
link
Aug 10 14:58:25 gitmaster kernel: [10731376.424371] ata4: COMRESET
failed (errno=-16)
Aug 10 14:58:25 gitmaster kernel: [10731376.424403] ata4: limiting SATA
link speed to 1.5 Gbps
Aug 10 14:58:25 gitmaster kernel: [10731376.424436] ata4: hard resetting
link
Aug 10 14:58:30 gitmaster kernel: [10731381.365521] ata3: COMRESET
failed (errno=-16)
Aug 10 14:58:30 gitmaster kernel: [10731381.365554] ata3: reset failed,
giving up
Aug 10 14:58:30 gitmaster kernel: [10731381.365585] ata3.00: disabled
Aug 10 14:58:30 gitmaster kernel: [10731381.365610] ata3.00: device
reported invalid CHS sector 0
Aug 10 14:58:30 gitmaster kernel: [10731381.365643] ata3: EH complete
Aug 10 14:58:30 gitmaster kernel: [10731381.365675] sd 2:0:0:0: [sdc]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.365701] sd 2:0:0:0: [sdc]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.365748] sd 2:0:0:0: [sdc]
CDB: Write(10): 2a 00 00 00 00 47 00 00 02 00
Aug 10 14:58:30 gitmaster kernel: [10731381.365816] end_request: I/O
error, dev sdc, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.365844] end_request: I/O
error, dev sdc, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.365871] md: super_written
gets error=-5, uptodate=0
Aug 10 14:58:30 gitmaster kernel: [10731381.365900] md/raid1:md2: Disk
failure on sdc1, disabling device.
Aug 10 14:58:30 gitmaster kernel: [10731381.365900] md/raid1:md2:
Operation continuing on 1 devices.
Aug 10 14:58:30 gitmaster kernel: [10731381.453474] ata4: COMRESET
failed (errno=-16)
Aug 10 14:58:30 gitmaster kernel: [10731381.453505] ata4: reset failed,
giving up
Aug 10 14:58:30 gitmaster kernel: [10731381.453536] ata4.00: disabled
Aug 10 14:58:30 gitmaster kernel: [10731381.453565] ata4: EH complete
Aug 10 14:58:30 gitmaster kernel: [10731381.453596] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.453621] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.453669] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 00 47 00 00 02 00
Aug 10 14:58:30 gitmaster kernel: [10731381.453737] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.453765] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.453792] md: super_written
gets error=-5, uptodate=0
Aug 10 14:58:30 gitmaster kernel: [10731381.453867] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.453894] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.453941] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 00 47 00 00 02 00
Aug 10 14:58:30 gitmaster kernel: [10731381.454010] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.454036] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.454064] md: super_written
gets error=-5, uptodate=0
Aug 10 14:58:30 gitmaster kernel: [10731381.454136] RAID1 conf printout:
Aug 10 14:58:30 gitmaster kernel: [10731381.454140]  --- wd:1 rd:2
Aug 10 14:58:30 gitmaster kernel: [10731381.454143]  disk 0, wo:0, o:1,
dev:sdd1
Aug 10 14:58:30 gitmaster kernel: [10731381.454146]  disk 1, wo:1, o:0,
dev:sdc1
Aug 10 14:58:30 gitmaster kernel: [10731381.477438] RAID1 conf printout:
Aug 10 14:58:30 gitmaster kernel: [10731381.477442]  --- wd:1 rd:2
Aug 10 14:58:30 gitmaster kernel: [10731381.477446]  disk 0, wo:0, o:1,
dev:sdd1
Aug 10 14:58:30 gitmaster kernel: [10731381.477477] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.477514] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.477562] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 0e c7 da 6f 00 00 18 00
Aug 10 14:58:30 gitmaster kernel: [10731381.477630] end_request: I/O
error, dev sdd, sector 247978607
Aug 10 14:58:30 gitmaster kernel: [10731381.477728] Aborting journal on
device md2-8.
Aug 10 14:58:30 gitmaster kernel: [10731381.477774] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.477802] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.477851] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 0e c4 08 3f 00 00 08 00
Aug 10 14:58:30 gitmaster kernel: [10731381.477922] end_request: I/O
error, dev sdd, sector 247728191
Aug 10 14:58:30 gitmaster kernel: [10731381.477944] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.477945] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.477947] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 08 3f 00 00 08 00
Aug 10 14:58:30 gitmaster kernel: [10731381.477950] end_request: I/O
error, dev sdd, sector 2111
Aug 10 14:58:30 gitmaster kernel: [10731381.477982] Buffer I/O error on
device md2, logical block 0
Aug 10 14:58:30 gitmaster kernel: [10731381.477983] lost page write due
to I/O error on md2
Aug 10 14:58:30 gitmaster kernel: [10731381.478011] EXT4-fs error
(device md2): ext4_journal_start_sb:327: Detected aborted journal
Aug 10 14:58:30 gitmaster kernel: [10731381.478013] EXT4-fs (md2):
Remounting filesystem read-only
Aug 10 14:58:30 gitmaster kernel: [10731381.478014] EXT4-fs (md2):
previous I/O error to superblock detected
Aug 10 14:58:30 gitmaster kernel: [10731381.478052] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.478054] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.478055] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 08 3f 00 00 08 00
Aug 10 14:58:30 gitmaster kernel: [10731381.478059] end_request: I/O
error, dev sdd, sector 2111
Aug 10 14:58:30 gitmaster kernel: [10731381.478078] Buffer I/O error on
device md2, logical block 0
Aug 10 14:58:30 gitmaster kernel: [10731381.478079] lost page write due
to I/O error on md2
Aug 10 14:58:30 gitmaster kernel: [10731381.485182] Buffer I/O error on
device md2, logical block 30965760
Aug 10 14:58:30 gitmaster kernel: [10731381.485184] lost page write due
to I/O error on md2
Aug 10 14:58:30 gitmaster kernel: [10731381.485190] JBD2: I/O error
detected when updating journal superblock for md2-8.
Aug 10 14:58:30 gitmaster mdadm[1470]: Fail event detected on md device
/dev/md/2, component device /dev/sdc1



-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
phone/fax: +49 511 762-17185 / -17193
https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome

<<attachment: smime.p7s>>


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux