> This doesn't make a lot of sense. It should not have been marked > as a spare unless someone explicitly tried to "Add" it to the > array. > > However you description of event suggests that this was automatic > which is strange. Yes, it was entirely automatic. The only commands I had running on the computer when it happened were: # watch -n 0.1 'uptime; echo; cat /proc/mdstat|grep md13 -A 2; echo; dmesg|tac' This gave me a nice, simple display of what was going on with the rebuild, and a monitor of dmesg in case there were any new kernel messages. > Can I get the complete kernel logs from when the rebuild started > to when you finally gave up? It might help me understand. Sure. Just to confirm, /dev/sd{a,b,c,d,e,f}1 are the partitions which contain my up-to-date data. /dev/sd{i,j}1 contain many days old data. Here is the entire dmesg output during the rebuild: [ 4245.3] md: md13 switched to read-write mode. [ 4260.7] md: md13 still in use. [ 4268.0] md: md13 still in use. [ 4269.8] md: md13 still in use. [ 4354.9] md: md13 still in use. [ 4402.9] md: md13 switched to read-only mode. [ 4408.1] md: md13 switched to read-write mode. I had tried to add the two old disks (sdi and sdj) while the array was in read-only mode for the rebuild, but it didn't allow me. Is there any way to mark the six valid disks as read-only so they will not be modified during the rebuild (and not become spares, have their event count updated, etc.)? [ 4418.3] md: bind<sdi1> [ 4418.4] RAID5 conf printout: [ 4418.4] --- rd:8 wd:6 [ 4418.4] disk 0, o:1, dev:sdi1 [ 4418.4] disk 1, o:1, dev:sdd1 [ 4418.4] disk 2, o:1, dev:sda1 [ 4418.4] disk 3, o:1, dev:sdf1 [ 4418.4] disk 5, o:1, dev:sdc1 [ 4418.4] disk 6, o:1, dev:sde1 [ 4418.4] disk 7, o:1, dev:sdb1 [ 4418.4] md: recovery of RAID array md13 [ 4418.4] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 4418.4] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [ 4418.4] md: using 128k window, over a total of 975490752 blocks. [ 4421.8] md: md_do_sync() got signal ... exiting [ 4421.9] md: md13 switched to read-only mode. [ 4549.0] md: md13 switched to read-write mode. I again switched back to read-only mode, hoping it would continue rebuilding, but it stopped, so I went back to read-write mode and it resumed the rebuild. [ 4549.0] RAID5 conf printout: [ 4549.0] --- rd:8 wd:6 [ 4549.0] disk 0, o:1, dev:sdi1 [ 4549.0] disk 1, o:1, dev:sdd1 [ 4549.0] disk 2, o:1, dev:sda1 [ 4549.0] disk 3, o:1, dev:sdf1 [ 4549.0] disk 5, o:1, dev:sdc1 [ 4549.0] disk 6, o:1, dev:sde1 [ 4549.0] disk 7, o:1, dev:sdb1 [ 4549.0] RAID5 conf printout: [ 4549.0] --- rd:8 wd:6 [ 4549.0] disk 0, o:1, dev:sdi1 [ 4549.0] disk 1, o:1, dev:sdd1 [ 4549.0] disk 2, o:1, dev:sda1 [ 4549.0] disk 3, o:1, dev:sdf1 [ 4549.0] disk 5, o:1, dev:sdc1 [ 4549.0] disk 6, o:1, dev:sde1 [ 4549.0] disk 7, o:1, dev:sdb1 [ 4549.0] md: recovery of RAID array md13 [ 4549.0] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 4549.0] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [ 4549.0] md: using 128k window, over a total of 975490752 blocks. [ 4549.0] md: resuming recovery of md13 from checkpoint. [ 4628.7] mdadm[19700]: segfault at 0 ip 000000000041617f sp 00007fff87776290 error 4 in mdadm[400000+2a000] This new version of mdadm from after my Ubuntu 9.10 upgrade with Linux 2.6.28 seg faults every time a new event happens, such as a disk being added or removed. Prior to the upgrade, using Linux 2.6.17 and whichever older version of mdadm it had, I had never seen it seg fault. # mdadm --version mdadm - v2.6.7.1 - 15th October 2008 [ 4647.7] ata1.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x6 frozen [ 4647.7] ata1.00: cmd 61/80:00:87:3c:63/00:00:00:00:00/40 tag 0 ncq 65536 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1.00: cmd 61/40:08:07:3d:63/00:00:00:00:00/40 tag 1 ncq 32768 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1.00: cmd 61/b0:10:47:3d:63/00:00:00:00:00/40 tag 2 ncq 90112 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1.00: cmd 61/b8:18:f7:3d:63/01:00:00:00:00/40 tag 3 ncq 225280 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1.00: cmd 61/60:20:af:3f:63/02:00:00:00:00/40 tag 4 ncq 311296 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1.00: cmd 61/08:28:0f:42:63/01:00:00:00:00/40 tag 5 ncq 135168 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1.00: cmd 61/b0:30:d7:43:63/00:00:00:00:00/40 tag 6 ncq 90112 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1.00: cmd 61/c0:38:17:43:63/00:00:00:00:00/40 tag 7 ncq 98304 out [ 4647.7] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 4647.7] ata1.00: status: { DRDY } [ 4647.7] ata1: hard resetting link [ 4648.2] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 4648.2] ata1.00: configured for UDMA/133 [ 4648.2] ata1: EH complete I've noticed that dmesg most often lists disks as "ata1", "ata9" etc. and I have found no way to convert these into /dev/sdc style format. Do you know how to translate these disk identifiers? It's really quite frustrating not knowing which disk an error/message is from, especially when 2 or 3 disks have issues at the same time. [ 4648.2] sd 0:0:0:0: [sdi] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [ 4648.2] sd 0:0:0:0: [sdi] Write Protect is off [ 4648.2] sd 0:0:0:0: [sdi] Mode Sense: 00 3a 00 00 [ 4648.2] sd 0:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Proceeding is when I added the last disk back into the set. I had hoped that both disks could rebuild simultaneously, but it seems to force it to only rebuild one at a time. Is there any way to rebuild both disks together? It is frustrating having two idle CPUs during the rebuild, and low disk throughput. I'm guessing mdadm is not a threaded app. Though I am actually going to keep /dev/sdj as a backup, in case there is no way to successfully read the data from /dev/sdc. sdj is a week older than the rest of the data, but something would be better than nothing. Though I would try mounting it read-only and using rsync to copy data off before I tried something that would break things like that. [ 4648.3] md: bind<sdj1> [ 4661.8] mdadm[19774]: segfault at 0 ip 000000000041617f sp 00007fff7630ae00 error 4 in mdadm[400000+2a000] [ 4662.2] mdadm[19854]: segfault at 0 ip 000000000041617f sp 00007fff72062b80 error 4 in mdadm[400000+2a000] [ 4697.7] mdadm[19913]: segfault at 0 ip 000000000041617f sp 00007fffefb31640 error 4 in mdadm[400000+2a000] [ 4697.7] mdadm[19912]: segfault at 0 ip 000000000041617f sp 00007fff9b1bacb0 error 4 in mdadm[400000+2a000] [ 4697.9] mdadm[19997]: segfault at 0 ip 000000000041617f sp 00007fffd001fb10 error 4 in mdadm[400000+2a000] [ 4697.9] mdadm[20016]: segfault at 0 ip 000000000041617f sp 00007fff4e9d44f0 error 4 in mdadm[400000+2a000] [ 4916.6] md: unbind<sdj1> [ 4916.6] md: export_rdev(sdj1) [ 4935.3] md: export_rdev(sdj1) [ 4935.4] md: bind<sdj1> At this point it was rebuilding fine. It had an ETA of 4.5 hours left, from the original 6.0 hours. I left the house. Following is the disk error when I was gone: [13691.4] ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 [13691.4] ata5.00: irq_stat 0x40000008 [13691.4] ata5.00: cmd 60/98:20:7f:af:fa/00:00:31:00:00/40 tag 4 ncq 77824 in [13691.4] res 41/40:00:f7:af:fa/09:00:31:00:00/40 Emask 0x409 (media error) <F> [13691.4] ata5.00: status: { DRDY ERR } [13691.4] ata5.00: error: { UNC } [13691.4] ata5.00: configured for UDMA/133 [13691.4] ata5: EH complete [13691.4] sd 4:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [13691.4] sd 4:0:0:0: [sdc] Write Protect is off [13691.4] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [13691.4] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [13693.4] ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 [13693.4] ata5.00: irq_stat 0x40000008 [13693.4] ata5.00: cmd 60/98:28:7f:af:fa/00:00:31:00:00/40 tag 5 ncq 77824 in [13693.4] res 41/40:00:f7:af:fa/09:00:31:00:00/40 Emask 0x409 (media error) <F> [13693.4] ata5.00: status: { DRDY ERR } [13693.4] ata5.00: error: { UNC } [13693.4] ata5.00: configured for UDMA/133 [13693.4] ata5: EH complete It seems to me like it simply disconnected and then reconnected. I have always had this issue on all sorts of hardware on 2.6 kernels, which makes me think it isn't always a hardware issue, and possibly a Linux kernel/driver issue. [13693.4] sd 4:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [13693.4] sd 4:0:0:0: [sdc] Write Protect is off [13693.4] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [13693.4] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [13694.4] ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 [13694.4] ata5.00: irq_stat 0x40000008 [13694.4] ata5.00: cmd 60/98:20:7f:af:fa/00:00:31:00:00/40 tag 4 ncq 77824 in [13694.4] res 41/40:00:f7:af:fa/09:00:31:00:00/40 Emask 0x409 (media error) <F> [13694.4] ata5.00: status: { DRDY ERR } [13694.4] ata5.00: error: { UNC } [13694.4] ata5.00: configured for UDMA/133 [13694.4] ata5: EH complete [13694.4] sd 4:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [13694.4] sd 4:0:0:0: [sdc] Write Protect is off [13694.4] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [13694.4] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [13695.4] ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 [13695.4] ata5.00: irq_stat 0x40000008 [13695.4] ata5.00: cmd 60/98:28:7f:af:fa/00:00:31:00:00/40 tag 5 ncq 77824 in [13695.4] res 41/40:00:f7:af:fa/09:00:31:00:00/40 Emask 0x409 (media error) <F> [13695.4] ata5.00: status: { DRDY ERR } [13695.4] ata5.00: error: { UNC } [13695.4] ata5.00: configured for UDMA/133 [13695.4] ata5: EH complete [13695.4] sd 4:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [13695.4] sd 4:0:0:0: [sdc] Write Protect is off [13695.4] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [13695.4] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [13696.4] ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 [13696.4] ata5.00: irq_stat 0x40000008 [13696.4] ata5.00: cmd 60/98:20:7f:af:fa/00:00:31:00:00/40 tag 4 ncq 77824 in [13696.4] res 41/40:00:f7:af:fa/09:00:31:00:00/40 Emask 0x409 (media error) <F> [13696.4] ata5.00: status: { DRDY ERR } [13696.4] ata5.00: error: { UNC } [13696.4] ata5.00: configured for UDMA/133 [13696.4] ata5: EH complete [13696.4] sd 4:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [13696.4] sd 4:0:0:0: [sdc] Write Protect is off [13696.4] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [13696.4] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [13697.4] ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 [13697.4] ata5.00: irq_stat 0x40000008 [13697.4] ata5.00: cmd 60/98:28:7f:af:fa/00:00:31:00:00/40 tag 5 ncq 77824 in [13697.4] res 41/40:00:f7:af:fa/09:00:31:00:00/40 Emask 0x409 (media error) <F> [13697.4] ata5.00: status: { DRDY ERR } [13697.4] ata5.00: error: { UNC } [13697.4] ata5.00: configured for UDMA/133 [13697.4] sd 4:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK [13697.4] sd 4:0:0:0: [sdc] Sense Key : Medium Error [current] [descriptor] [13697.4] Descriptor sense data with sense descriptors (in hex): [13697.4] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [13697.4] 31 fa af f7 [13697.4] sd 4:0:0:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed [13697.4] end_request: I/O error, dev sdc, sector 838512631 [13697.4] raid5:md13: read error not correctable (sector 838512568 on sdc1). [13697.4] raid5: Disk failure on sdc1, disabling device. [13697.4] raid5: Operation continuing on 5 devices. This last line is something I have been baffled by -- how does a RAID-5 or RAID-6 device continue as "active" when fewer than the minimum number of disks is present? This happened with my RAID-5 swap array losing 2 disks, and happened above on a RAID-6 with only 5 of 8 disks. When I arrived home, it clearly said the array was still "active". [13697.4] raid5:md13: read error not correctable (sector 838512576 on sdc1). [13697.4] raid5:md13: read error not correctable (sector 838512584 on sdc1). [13697.4] raid5:md13: read error not correctable (sector 838512592 on sdc1). [13697.4] ata5: EH complete [13697.4] sd 4:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [13697.4] sd 4:0:0:0: [sdc] Write Protect is off [13697.4] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [13697.4] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [13711.0] md: md13: recovery done. What is this "recovery done" referring to? No recovery was completed. [13711.1] RAID5 conf printout: [13711.1] --- rd:8 wd:5 [13711.1] disk 0, o:1, dev:sdi1 [13711.1] disk 1, o:1, dev:sdd1 [13711.1] disk 2, o:1, dev:sda1 [13711.1] disk 3, o:1, dev:sdf1 [13711.1] disk 5, o:0, dev:sdc1 [13711.1] disk 6, o:1, dev:sde1 [13711.1] disk 7, o:1, dev:sdb1 [13711.1] RAID5 conf printout: [13711.1] --- rd:8 wd:5 [13711.1] disk 1, o:1, dev:sdd1 [13711.1] disk 2, o:1, dev:sda1 [13711.1] disk 3, o:1, dev:sdf1 [13711.1] disk 5, o:0, dev:sdc1 [13711.1] disk 6, o:1, dev:sde1 [13711.1] disk 7, o:1, dev:sdb1 [13711.1] RAID5 conf printout: [13711.1] --- rd:8 wd:5 [13711.1] disk 1, o:1, dev:sdd1 [13711.1] disk 2, o:1, dev:sda1 [13711.1] disk 3, o:1, dev:sdf1 [13711.1] disk 5, o:0, dev:sdc1 [13711.1] disk 6, o:1, dev:sde1 [13711.1] disk 7, o:1, dev:sdb1 [13711.1] RAID5 conf printout: [13711.1] --- rd:8 wd:5 [13711.1] disk 1, o:1, dev:sdd1 [13711.1] disk 2, o:1, dev:sda1 [13711.1] disk 3, o:1, dev:sdf1 [13711.1] disk 6, o:1, dev:sde1 [13711.1] disk 7, o:1, dev:sdb1 I arrived home and performed the following commands (I have removed some of the duplicate commands): # mdadm --verbose --verbose --detail --scan /dev/md13 # mdadm --verbose --verbose --detail --scan /dev/md13 # mdadm /dev/md13 --remove /dev/sdj1 /dev/sdi1 # mdadm --verbose --verbose --detail --scan /dev/md13 # mdadm /dev/md13 --remove /dev/sdc1 # mdadm --verbose --verbose --detail --scan /dev/md13 # mdadm /dev/md13 --re-add /dev/sdc1 # mdadm --verbose --verbose --detail --scan /dev/md13 # mdadm /dev/md13 --remove /dev/sdc1 # mdadm --verbose --verbose --detail --scan /dev/md13 # mdadm --readonly /dev/md13 # cat /proc/mdstat # man mdadm # mdadm --stop /dev/md13 # c; for disk in /dev/sd{a,b,c,d,e,f}1; do mdadm --examine "$disk"; read; c; done # c; for disk in /dev/sd{a,b,c,d,e,f}1; do printf "$disk"; mdadm --examine "$disk" | g events; done # mdadm --stop /dev/md13 # mdadm --assemble /dev/md13 --verbose --force /dev/sd{a,b,c,d,e,f}1 # mdadm --stop /dev/md13 # mdadm --verbose --examine /dev/sdc1 I also detached the /dev/sdc disk and reattached it to my other SATA controller. [21281.4] md: unbind<sdj1> [21281.4] md: export_rdev(sdj1) [21281.4] md: unbind<sdi1> [21281.4] md: export_rdev(sdi1) [21281.5] Buffer I/O error on device md13, logical block 1463236112 [21281.5] Buffer I/O error on device md13, logical block 1463236112 [21281.5] Buffer I/O error on device md13, logical block 1463236126 [21281.5] Buffer I/O error on device md13, logical block 1463236126 [21281.5] Buffer I/O error on device md13, logical block 1463236127 [21281.5] Buffer I/O error on device md13, logical block 1463236127 [21281.5] Buffer I/O error on device md13, logical block 1463236127 [21281.5] Buffer I/O error on device md13, logical block 1463236127 [21281.5] Buffer I/O error on device md13, logical block 1463236127 [21281.5] Buffer I/O error on device md13, logical block 1463236127 [21307.3] md: unbind<sdc1> [21307.3] md: export_rdev(sdc1) [21307.4] __ratelimit: 6 callbacks suppressed [21307.4] Buffer I/O error on device md13, logical block 1463236112 [21307.4] Buffer I/O error on device md13, logical block 1463236112 [21307.4] Buffer I/O error on device md13, logical block 1463236126 [21307.4] Buffer I/O error on device md13, logical block 1463236126 [21307.4] Buffer I/O error on device md13, logical block 1463236127 [21307.4] Buffer I/O error on device md13, logical block 1463236127 [21307.4] Buffer I/O error on device md13, logical block 1463236127 [21307.4] Buffer I/O error on device md13, logical block 1463236127 [21307.4] Buffer I/O error on device md13, logical block 1463236127 [21307.4] Buffer I/O error on device md13, logical block 1463236127 [21323.4] md: bind<sdc1> [21323.5] __ratelimit: 6 callbacks suppressed [21323.5] Buffer I/O error on device md13, logical block 1463236112 [21323.5] Buffer I/O error on device md13, logical block 1463236112 [21323.5] Buffer I/O error on device md13, logical block 1463236126 [21323.5] Buffer I/O error on device md13, logical block 1463236126 [21323.5] Buffer I/O error on device md13, logical block 1463236127 [21323.5] Buffer I/O error on device md13, logical block 1463236127 [21323.5] Buffer I/O error on device md13, logical block 1463236127 [21323.5] Buffer I/O error on device md13, logical block 1463236127 [21323.5] Buffer I/O error on device md13, logical block 1463236127 [21323.5] Buffer I/O error on device md13, logical block 1463236127 [21350.1] md: unbind<sdc1> [21350.1] md: export_rdev(sdc1) [21350.2] __ratelimit: 6 callbacks suppressed [21350.2] Buffer I/O error on device md13, logical block 1463236112 [21350.2] Buffer I/O error on device md13, logical block 1463236112 [21350.2] Buffer I/O error on device md13, logical block 1463236126 [21350.2] Buffer I/O error on device md13, logical block 1463236126 [21350.2] Buffer I/O error on device md13, logical block 1463236127 [21350.2] Buffer I/O error on device md13, logical block 1463236127 [21350.2] Buffer I/O error on device md13, logical block 1463236127 [21350.2] Buffer I/O error on device md13, logical block 1463236127 [21350.2] Buffer I/O error on device md13, logical block 1463236127 [21350.2] Buffer I/O error on device md13, logical block 1463236127 [21368.1] md: md13 switched to read-only mode. [21368.1] __ratelimit: 6 callbacks suppressed [21368.1] Buffer I/O error on device md13, logical block 1463236112 [21368.1] Buffer I/O error on device md13, logical block 1463236112 [21368.1] Buffer I/O error on device md13, logical block 1463236126 [21368.1] Buffer I/O error on device md13, logical block 1463236126 [21368.1] Buffer I/O error on device md13, logical block 1463236127 [21368.1] Buffer I/O error on device md13, logical block 1463236127 [21368.1] Buffer I/O error on device md13, logical block 1463236127 [21368.1] Buffer I/O error on device md13, logical block 1463236127 [21368.1] Buffer I/O error on device md13, logical block 1463236127 [21368.1] Buffer I/O error on device md13, logical block 1463236127 [21488.8] md: md13 stopped. [21488.8] md: unbind<sdf1> [21488.8] md: export_rdev(sdf1) [21488.8] md: unbind<sda1> [21488.8] md: export_rdev(sda1) [21488.8] md: unbind<sdd1> [21488.8] md: export_rdev(sdd1) [21488.8] md: unbind<sde1> [21488.8] md: export_rdev(sde1) [21488.8] md: unbind<sdb1> [21488.8] md: export_rdev(sdb1) [22603.8] ata5: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozen [22603.8] ata5: irq_stat 0x00400000, PHY RDY changed [22603.8] ata5: SError: { PHYRdyChg LinkSeq TrStaTrns } [22603.8] ata5: hard resetting link [22604.5] ata5: SATA link down (SStatus 0 SControl 300) [22609.5] ata5: hard resetting link [22609.8] ata5: SATA link down (SStatus 0 SControl 300) [22609.8] ata5: limiting SATA link speed to 1.5 Gbps [22614.8] ata5: hard resetting link [22615.2] ata5: SATA link down (SStatus 0 SControl 310) [22615.2] ata5.00: disabled [22615.2] ata5: EH complete [22615.2] ata5.00: detaching (SCSI 4:0:0:0) [22615.2] sd 4:0:0:0: [sdc] Synchronizing SCSI cache [22615.2] sd 4:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK [22615.2] sd 4:0:0:0: [sdc] Stopping disk [22615.2] sd 4:0:0:0: [sdc] START_STOP FAILED [22615.2] sd 4:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK [22640.1] ata8: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0xe frozen [22640.1] ata8: SError: { PHYRdyChg CommWake } [22640.1] ata8: hard resetting link [22640.8] ata8: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [22640.9] ata8.00: ATA-7: SAMSUNG HD103UJ, 1AA01109, max UDMA7 [22640.9] ata8.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 0/32) [22640.9] ata8.00: configured for UDMA/100 [22640.9] ata8: EH complete [22640.9] scsi 7:0:0:0: Direct-Access ATA SAMSUNG HD103UJ 1AA0 PQ: 0 ANSI: 5 [22640.9] sd 7:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [22640.9] sd 7:0:0:0: [sdc] Write Protect is off [22640.9] sd 7:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [22640.9] sd 7:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [22640.9] sd 7:0:0:0: [sdc] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) [22640.9] sd 7:0:0:0: [sdc] Write Protect is off [22640.9] sd 7:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [22640.9] sd 7:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [22640.9] sdc: sdc1 sdc2 [22640.9] sd 7:0:0:0: [sdc] Attached SCSI disk [22640.9] sd 7:0:0:0: Attached scsi generic sg2 type 0 [22641.0] md: bind<sdc1> [22687.9] md: md13 stopped. [22687.9] md: unbind<sdc1> [22687.9] md: export_rdev(sdc1) [22804.2] md: md13 stopped. [22804.2] md: bind<sda1> [22804.2] md: bind<sdf1> [22804.2] md: bind<sde1> [22804.2] md: bind<sdb1> [22804.2] md: bind<sdc1> [22804.2] md: bind<sdd1> [22864.5] md: md13 stopped. [22864.5] md: unbind<sdd1> [22864.6] md: export_rdev(sdd1) [22864.6] md: unbind<sdc1> [22864.6] md: export_rdev(sdc1) [22864.6] md: unbind<sdb1> [22864.6] md: export_rdev(sdb1) [22864.6] md: unbind<sde1> [22864.6] md: export_rdev(sde1) [22864.6] md: unbind<sdf1> [22864.6] md: export_rdev(sdf1) [22864.6] md: unbind<sda1> [22864.6] md: export_rdev(sda1) > As long as there are two missing devices no resync will happen so the > data will not be changed. So after doing a --create you can fsck and > mount etc and ensure the data is safe before continuing. Thank you, that is useful information. Do you know if the data on /dev/sdc1 would be altered as a result of it becoming a Spare after it disconnected and reconnected itself? > But if you cannot get though a sequential read of all devices without > any read error, you wont be able to rebuild redundancy. (There are > plans to make raid6 more robust in this scenario, but they are a long > way from fruition yet). Prior to attempting the rebuild, I did the following: # dd if=/dev/sda1 of=/dev/null & # dd if=/dev/sdb1 of=/dev/null & # dd if=/dev/sdc1 of=/dev/null & # dd if=/dev/sdd1 of=/dev/null & # dd if=/dev/sde1 of=/dev/null & # dd if=/dev/sdf1 of=/dev/null & # dd if=/dev/sdi1 of=/dev/null & # dd if=/dev/sdj1 of=/dev/null & I left it running for about an hour, and none of the disks had any errors. I really hope it is not a permanent fault 75% of the way through the disk. Though if it was just bad sectors, why would the disk be disconnecting from the system? Thanks again for all your help. - S.A. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html