Hello Robin, hello Chris, thanks for the help, the ideas and the discussion so far. And sorry for the late response, i cam currently down with a cold. Let me rollup the discussion so far with a little background: One month ago, the RAID was expanded with the ninth HDD without any rebuild problems. Last weekend I upgraded the server and kept only the HDDs and a Marvell-based 4 Port SATA Controller. sda to sdf are connected to the onboard AMD SB950, sdg-sdj to the Marvell controller which has been always a little troublesome especially with Western Digitals. Before adding or replacing a new HDD to the array, i always run a badblocks write-read test on it, but of course this doesn't help when blocks become bad over time. I posted the kernel logs since the last reboot before the RAID failed at http://evilazrael.net/bilder2/logs/kernel_20130202.log (8k lines, 600kb) or http://evilazrael.net/bilder2/logs/kernel_20130202.log.gz (44kb) The SMART logs are http://evilazrael.net/bilder2/logs/smart_20130202.tar.gz if somebody is curious. Yes, i roasted the Hitachis when i forget to plugin the cage fan. After sdg was expelled the first time (Jan 28 00:23), I ran an extended SMART test and then a read-write badblocks on it for almost 48hs. After both found no errors I tried to readd it (Jan 30 18:19). And on Jan 31 00:34 the UREs broke the rebuild and kicked both drives :\ In the last two days I did a non-destructive badblocks on all devices, only sdj reports some UREs consistently. After that i tried two force-assembles. First broke on a read error on sdh. Then i retried and this time the error on sdh didn't occur, but the later UREs on sdj killed the rebuild. At the beginning of the second try some automounter kicked in and mounted the FS and I saw the contents of the FS, so at least the first try didn't do additionally damage :-) Tomorrow I will buy a new drive and dd_rescue sdj to the new drive. And if that works then I will switch to RAID6 ASAP and check/replace all other drives. If not, I won't need the drives anymore. >> Also I'd like to know what model disks these are, if they're AF or >> not. /dev/sdb ST3000DM001-9YN1 CC4B (Seagate Barracuda 7200) /dev/sdc WDC WD30EZRX-00M 80.0 (WDC Green SATA 3) /dev/sdd WDC WD30EZRS-00J 80.0 (WDC Green SATA 2) /dev/sde WDC WD30EFRX-68A 80.0 (WDC Red) /dev/sdf WDC WD30EURS-63R 80.0 (WDC AV-GP) /dev/sdg Hitachi HDS72303 MKAO (Deskstar 7k3000) /dev/sdh Hitachi HDS72303 MKAO (Deskstar 7k3000) /dev/sdi Hitachi HDS72303 MKAO (Deskstar 7k3000) /dev/sdj WDC WD30EZRX-00M 80.0 (WDC Green SATA 3) AV-GP and Red are marketed as 24/7 and RAID-capable, but the availability was bad. > If you're using standard desktop drives then you may be running into > issues with the drive timeout being longer than the kernel's. You need > to reset on or the other to ensure that the drive times out (and is > available for subsequent commands) before the kernel does. Most current > consumer drives don't allow resetting the timeout, but it's worth trying > that first before changing the kernel timeout. For each > drive, do: > smartctl -l scterc,70,70 /dev/sdX > || echo 180 > /sys/block/sdX/device/timeout > Only the WDC Red supports that. The drives on the Marvell Controller all report SCT Error Recovery Control: Read: Disabled Write: Disabled To be honest, I don't trust SMART much and prefer a write/read badblocks over SMART tests. But of course i won't do that on a disk which has data on it. >>>> Yes, if sdg still contains valid array data (and the array >>>> wasn't >>> written since then) then it would definitely make more sense to >>> recreate the array using it, leaving sdj out for now. That'll >>> require more work checking mdadm versions and data offset values >>> though. That'll avoid the issues with the unreadable blocks on >>> sdj. >> >> Here's an idea. One possibility is to use dd to read the sector on >> sdg1 that error1.txt reported with the write error, to a file, and >> see if there's a read error. If not, rewrite that data back to the >> same sector and see if there's a write error. If not, attempt to >> force assemble assume clean, get the array up in degraded mode, and >> do a non-destructive fsck. If that's OK, just take a backup >> immediately. Then sdj can be destructively written to, to force bad >> sectors there to be removed for reserves, but still needs a smart >> extended offline test to confirm; and then possibly reused and >> rebuilt. >> > That won't work. He's already lost the metadata on sdg1 by trying to > rebuild it in the first place, so a force assemble won't work. He'd > need to recreate the array instead. Otherwise yes, that would sound > to be the best option (assuming there's no other read errors on the > other disks). I think I don't like this part of the discussion ("That won't work"). I hope no question is left open Kind regards and thanks for all the help so far Christoph Am 01.02.2013 20:57, schrieb Robin Hill: > On Fri Feb 01, 2013 at 10:27:57 -0700, Chris Murphy wrote: > >> >> On Feb 1, 2013, at 6:34 AM, Robin Hill <robin@xxxxxxxxxxxxxxx> wrote: >>> It'd also be useful to know whether sdg has been rewritten at >>> all since then (i.e. whether the testing was destructive or not), and >>> whether or not the array was written to at all since the failure of sdg. >> >> OP needs to reply back. >> >> > > Cheers, > Robin -- Christoph Nelles E-Mail : evilazrael@xxxxxxxxxxxxx Jabber : eazrael@xxxxxxxxxxxxxx ICQ : 78819723 PGP-Key : ID 0x424FB55B on subkeys.pgp.net or http://evilazrael.net/pgp.txt -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html