I would say that these are probably desktop type drives - not enterprise drives. In which case, what you're probably hitting are media errors - when a drive hits a bad sector. A consumer level drive will try as hard as possible to retrieve the data - I'd say somewhere around 15 seconds, depending on the manufacturer. This is often enough to get the drive kicked out of the array. Enterprise level drives usually ship with this feature disabled, so that the raid can take care of it. Some manufacturers provide utilities to manage this feature. Until late last year, for example, western digital had a utility called WDTLER. According to user reports, however, the utility no longer works on wd drives manufactured after Nov 2009. I suggest you see if your manufacturer provides a similar utility. Russ Sent from my Verizon Wireless BlackBerry -----Original Message----- From: Dan Christensen <jdc@xxxxxx> Date: Wed, 02 Jun 2010 10:14:28 To: <linux-raid@xxxxxxxxxxxxxxx> Subject: devices get kicked from RAID about once a month Over the past 5 months, I've had a drive booted from one of my raid arrays about 6 times. In each case, the drive passes SMART tests, so I --remove it, --re-add it, and it resyncs successfully. I tried disconnecting and re-connecting all four SATA cables, but the problem occurred again. In fact, today *two* partitions were kicked out of their (different) raid devices. All of the problems occurred with sda and sdc, which are older drives: sda: SAMSUNG SP2004C sdc: SAMSUNG SP2504C hddtemp shows the temperatures at 32C. System runs Debian lenny, with newer kernel than lenny: 2.6.28. mdadm version v2.6.7.2. Motherboard is a Gigabyte GA-E7AUM-DS2H. I couldn't find the controller chipset info. Are the drives just bad? Or is it the controller? More detailed information is below. Thanks for any help! Let me know if I should provide more information. Dan syslog messages from today: Jun 2 03:54:22 boots kernel: [66986.000043] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jun 2 03:54:23 boots kernel: [66986.000052] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jun 2 03:54:23 boots kernel: [66986.000053] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 2 03:54:23 boots kernel: [66986.000056] ata1.00: status: { DRDY } Jun 2 03:54:23 boots kernel: [66986.000064] ata1: hard resetting link Jun 2 03:54:23 boots kernel: [66986.484037] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jun 2 03:54:23 boots kernel: [66986.494003] ata1.00: configured for UDMA/133 Jun 2 03:54:23 boots kernel: [66986.494016] end_request: I/O error, dev sda, sector 187880006 Jun 2 03:54:23 boots kernel: [66986.494023] md: super_written gets error=-5, uptodate=0 Jun 2 03:54:23 boots kernel: [66986.494027] raid5: Disk failure on sda7, disabling device. Jun 2 03:54:24 boots kernel: [66986.494029] raid5: Operation continuing on 3 devices. Jun 2 03:54:24 boots kernel: [66986.494045] ata1: EH complete Jun 2 03:54:24 boots kernel: [66986.494215] sd 0:0:0:0: [sda] 390719855 512-byte hardware sectors: (200 GB/186 GiB) Jun 2 03:54:24 boots kernel: [66986.494244] sd 0:0:0:0: [sda] Write Protect is off Jun 2 03:54:24 boots kernel: [66986.494248] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Jun 2 03:54:24 boots kernel: [66986.494274] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jun 2 03:54:24 boots kernel: [66986.762936] RAID5 conf printout: Jun 2 03:54:24 boots mdadm[4109]: Fail event detected on md device /dev/md3, component device /dev/sda7 Jun 2 03:54:24 boots kernel: [66986.762942] --- rd:4 wd:3 Jun 2 03:54:24 boots kernel: [66986.762946] disk 0, o:0, dev:sda7 Jun 2 03:54:24 boots kernel: [66986.762948] disk 1, o:1, dev:sdb3 Jun 2 03:54:24 boots kernel: [66986.762950] disk 2, o:1, dev:sdc5 Jun 2 03:54:24 boots kernel: [66986.762953] disk 3, o:1, dev:sdd3 Jun 2 03:54:24 boots kernel: [66986.763626] RAID5 conf printout: Jun 2 03:54:24 boots kernel: [66986.763628] --- rd:4 wd:3 Jun 2 03:54:24 boots kernel: [66986.763630] disk 1, o:1, dev:sdb3 Jun 2 03:54:24 boots kernel: [66986.763632] disk 2, o:1, dev:sdc5 Jun 2 03:54:24 boots kernel: [66986.763634] disk 3, o:1, dev:sdd3 Jun 2 06:59:33 boots kernel: [78097.000087] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jun 2 06:59:34 boots kernel: [78097.000095] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jun 2 06:59:34 boots kernel: [78097.000096] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 2 06:59:34 boots kernel: [78097.000099] ata4.00: status: { DRDY } Jun 2 06:59:34 boots kernel: [78097.000106] ata4: hard resetting link Jun 2 06:59:34 boots kernel: [78097.484057] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jun 2 06:59:35 boots kernel: [78097.493930] ata4.00: configured for UDMA/133 Jun 2 06:59:35 boots kernel: [78097.493941] end_request: I/O error, dev sdc, sector 488391944 Jun 2 06:59:35 boots kernel: [78097.493947] md: super_written gets error=-5, uptodate=0 Jun 2 06:59:35 boots kernel: [78097.493952] raid5: Disk failure on sdc7, disabling device. Jun 2 06:59:35 boots kernel: [78097.493953] raid5: Operation continuing on 2 devices. Jun 2 06:59:35 boots kernel: [78097.493967] ata4: EH complete Jun 2 06:59:35 boots kernel: [78097.494105] sd 3:0:0:0: [sdc] 488397168 512-byte hardware sectors: (250 GB/232 GiB) Jun 2 06:59:35 boots kernel: [78097.494124] sd 3:0:0:0: [sdc] Write Protect is off Jun 2 06:59:35 boots kernel: [78097.494127] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00 Jun 2 06:59:35 boots kernel: [78097.494156] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jun 2 06:59:35 boots mdadm[4109]: Fail event detected on md device /dev/md5, component device /dev/sdc7 Jun 2 06:59:35 boots kernel: [78097.635934] RAID5 conf printout: Jun 2 06:59:35 boots kernel: [78097.635938] --- rd:3 wd:2 Jun 2 06:59:35 boots kernel: [78097.635941] disk 0, o:1, dev:sdb6 Jun 2 06:59:35 boots kernel: [78097.635944] disk 1, o:0, dev:sdc7 Jun 2 06:59:35 boots kernel: [78097.635946] disk 2, o:1, dev:sdd6 Jun 2 06:59:36 boots kernel: [78097.636143] RAID5 conf printout: Jun 2 06:59:36 boots kernel: [78097.636146] --- rd:3 wd:2 Jun 2 06:59:36 boots kernel: [78097.636148] disk 0, o:1, dev:sdb6 Jun 2 06:59:36 boots kernel: [78097.636150] disk 2, o:1, dev:sdd6 ------------------------ /proc/mdstat: Personalities : [raid1] [raid6] [raid5] [raid4] md6 : active raid1 sdb7[0] sdd7[1] 196290048 blocks [2/2] [UU] bitmap: 1/3 pages [4KB], 32768KB chunk md5 : active raid5 sdc7[3] sdb6[0] sdd6[2] 175815168 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U] [=================>...] recovery = 89.3% (78552568/87907584) finish=4.6min speed=33323K/sec bitmap: 1/2 pages [4KB], 32768KB chunk md4 : active raid5 sda8[0] sdd5[3] sdc6[2] sdb5[1] 218636160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] bitmap: 0/2 pages [0KB], 32768KB chunk md3 : active raid5 sda7[4] sdd3[3] sdc5[2] sdb3[1] 218612160 blocks level 5, 64k chunk, algorithm 2 [4/3] [_UUU] resync=DELAYED bitmap: 2/2 pages [8KB], 32768KB chunk md2 : active raid5 sda6[0] sdd2[3] sdc2[2] sdb2[1] 30748032 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] bitmap: 1/1 pages [4KB], 32768KB chunk md0 : active raid5 sda2[0] sdd1[2] sdc1[1] 578048 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/1 pages [0KB], 32768KB chunk md1 : active raid1 sdb1[0] sda5[1] 289024 blocks [2/2] [UU] bitmap: 0/1 pages [0KB], 32768KB chunk unused devices: <none> ------------------ /etc/mdadm/mdadm.conf: # mdadm.conf # # Please refer to mdadm.conf(5) for information about this file. # # by default, scan all partitions (/proc/partitions) for MD superblocks. # alternatively, specify devices to scan, using wildcards if desired. DEVICE partitions # auto-create devices with Debian standard permissions CREATE owner=root group=disk mode=0660 auto=yes # automatically tag new arrays as belonging to the local system HOMEHOST <system> # instruct the monitoring daemon where to send mail alerts MAILADDR jdc@xxxxxx # definitions of existing MD arrays ARRAY /dev/md1 level=raid1 num-devices=2 UUID=6b8b4567:327b23c6:643c9869:66334873 ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ba493129:00074cd3:fee07e15:038135d5 ARRAY /dev/md2 level=raid5 num-devices=4 UUID=3dc9b50b:b9270472:9778d943:b967813b ARRAY /dev/md3 level=raid5 num-devices=4 UUID=c4056d19:7b4bb550:44925b88:91d5bc8a ARRAY /dev/md4 level=raid5 num-devices=4 UUID=d7c84402:210b78c7:556bbbc0:47df436c ARRAY /dev/md5 level=raid5 num-devices=3 UUID=9effd43f:93ccc32d:899ca6c7:ea966964 ARRAY /dev/md6 level=raid1 num-devices=2 UUID=da17264f:be7e012d:85187211:fb0e2ebd -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html ÿôèº{.nÇ+?·?®??+%?Ëÿ±éݶ¥?wÿº{.nÇ+?·¥?{±þ¶¢wø§¶?¡Ü¨}©?²Æ zÚ&j:+v?¨þø¯ù®w¥þ?à2?Þ?¨èÚ&¢)ß¡«a¶Úÿÿûàz¿äz¹Þ?ú+?ù???Ý¢jÿ?wèþf