Hi Vegard, On 07/15/2016 01:02 PM, Vegard Haugland wrote: >> This time, omit the drive that shows up as "spare". Use all nine >> others. You really want nine, so the redundancy in your array can >> reconstruct when it hits the UREs you obviously have. See "Current >> Pending Sector" != 0 in your smartctl reports. >> >> After it assembles the nine, issue "mdadm --run /dev/md4" if it didn't >> start. Then "echo check >>/sys/block/md4/md/sync_action". >> >> Wait for that to finish. Then add the spare back to the array. > > OK. Here's whats been happening for the past two days. I can this > command to rebuild the array > > # mdadm -A /dev/md4 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 > /dev/sdg3 /dev/sdh3 /dev/sdi3 /dev/sdj3 --force > mdadm: forcing event count in /dev/sdc3(4) from 32679348 upto 32680549 > mdadm: forcing event count in /dev/sdd3(3) from 32677935 upto 32680549 > mdadm: clearing FAULTY flag for device 3 in /dev/md4 for /dev/sdd3 > mdadm: clearing FAULTY flag for device 2 in /dev/md4 for /dev/sdc3 > mdadm: Marking array /dev/md4 as 'clean' > mdadm: /dev/md4 has been started with 8 drives (out of 10) and 1 spare. Well, that sucks. I expected 9 drives out of 10 and no spare. You are somewhat screwed. > As the output mentioned, the array started and I got access to the > data again. Yay! In order to start the rebuild, I ran "echo check >>> /sys/block/md4/md/sync_action". > > The array just finished the rebuild, but not with the results I hoped > for. Here's the output from mdadm -D /dev/md4 You can't successfully check or rebuild without redundancy if any device has an Unrecoverable Read Error. Which you know you have because Pending is greater than zero on sdc (maybe others, I didn't check). > For some reason, the other faulty disk (not the one mentioned below, > or the one that initially showed up as spare) also shows up as spare > now (like the good one did earlier). Issuing "echo check >> > /sys/block/md4/md/sync_action" does not make any attempts to rebuild > the array. Should I use mdadm --manage --add to re-add it before I > replace the faulty disk with a new one? It is possible your mdadm is too old to fully --force assembly in this case, or it is a side effect of old v0.90 metadata. With v1.x+ and bitmaps, re-add is very fast and unlikely to hit other UREs. Too late for you. Anyways, you can only use the eight non-spare drives. Since at least one has a URE, you will need to use ddrescue to complete copy those drives onto new drives (or your "spares" with their superblocks erased). Then assemble the array (8 members) with the new drives and the non-URE old drives. Then add spares one at a time, waiting for rebuild to finish for each. If you have very critical data you want to make sure you retrieve, assemble the eight non-spare drives one more time before making copies, *don't* do the check, then mount and copy those critical files. Might hit a URE anyways, but is the best way to get a quick backup. Once you start copying devices, be very careful that you keep copies and roles straight, so that you don't try to assemble with two devices with the same role number. You *will* lose some data where the UREs are. You'll need to fsck to fix the corruption, if it happens to be in a file or folder. Phil -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html