Re: devices get kicked from RAID about once a month

rsivak@xxxxxxxxxx · Wed, 2 Jun 2010 15:02:26 +0000

I would say that these are probably desktop type drives - not enterprise drives.  In which case, what you're probably hitting are media errors - when a drive hits a bad sector.  A consumer level drive will try as hard as possible to retrieve the data - I'd say somewhere around 15 seconds, depending on the manufacturer.  This is often enough to get the drive kicked out of the array.  

Enterprise level drives usually ship with this feature disabled, so that the raid can take care of it.  

Some manufacturers provide utilities to manage this feature.  Until late last year, for example, western digital had a utility called WDTLER.  According to user reports, however, the utility no longer works on wd drives manufactured after Nov 2009.

I suggest you see if your manufacturer provides a similar utility.

Russ
Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Dan Christensen <jdc@xxxxxx>
Date:	Wed, 02 Jun 2010 10:14:28 
To: <linux-raid@xxxxxxxxxxxxxxx>
Subject: devices get kicked from RAID about once a month

Over the past 5 months, I've had a drive booted from one of my raid
arrays about 6 times.  In each case, the drive passes SMART tests, so I
--remove it, --re-add it, and it resyncs successfully.

I tried disconnecting and re-connecting all four SATA cables, but the
problem occurred again.  In fact, today *two* partitions were kicked out
of their (different) raid devices.

All of the problems occurred with sda and sdc, which are older drives:

sda:     SAMSUNG SP2004C
sdc:     SAMSUNG SP2504C

hddtemp shows the temperatures at 32C.

System runs Debian lenny, with newer kernel than lenny: 2.6.28.
mdadm version v2.6.7.2.

Motherboard is a Gigabyte GA-E7AUM-DS2H.  I couldn't find the controller
chipset info.

Are the drives just bad?  Or is it the controller?

More detailed information is below.  Thanks for any help!  Let me know
if I should provide more information.

Dan

syslog messages from today:

Jun  2 03:54:22 boots kernel: [66986.000043] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun  2 03:54:23 boots kernel: [66986.000052] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jun  2 03:54:23 boots kernel: [66986.000053]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  2 03:54:23 boots kernel: [66986.000056] ata1.00: status: { DRDY }
Jun  2 03:54:23 boots kernel: [66986.000064] ata1: hard resetting link
Jun  2 03:54:23 boots kernel: [66986.484037] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun  2 03:54:23 boots kernel: [66986.494003] ata1.00: configured for UDMA/133
Jun  2 03:54:23 boots kernel: [66986.494016] end_request: I/O error, dev sda, sector 187880006
Jun  2 03:54:23 boots kernel: [66986.494023] md: super_written gets error=-5, uptodate=0
Jun  2 03:54:23 boots kernel: [66986.494027] raid5: Disk failure on sda7, disabling device.
Jun  2 03:54:24 boots kernel: [66986.494029] raid5: Operation continuing on 3 devices.
Jun  2 03:54:24 boots kernel: [66986.494045] ata1: EH complete
Jun  2 03:54:24 boots kernel: [66986.494215] sd 0:0:0:0: [sda] 390719855 512-byte hardware sectors: (200 GB/186 GiB)
Jun  2 03:54:24 boots kernel: [66986.494244] sd 0:0:0:0: [sda] Write Protect is off
Jun  2 03:54:24 boots kernel: [66986.494248] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jun  2 03:54:24 boots kernel: [66986.494274] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jun  2 03:54:24 boots kernel: [66986.762936] RAID5 conf printout:
Jun  2 03:54:24 boots mdadm[4109]: Fail event detected on md device /dev/md3, component device /dev/sda7
Jun  2 03:54:24 boots kernel: [66986.762942]  --- rd:4 wd:3
Jun  2 03:54:24 boots kernel: [66986.762946]  disk 0, o:0, dev:sda7
Jun  2 03:54:24 boots kernel: [66986.762948]  disk 1, o:1, dev:sdb3
Jun  2 03:54:24 boots kernel: [66986.762950]  disk 2, o:1, dev:sdc5
Jun  2 03:54:24 boots kernel: [66986.762953]  disk 3, o:1, dev:sdd3
Jun  2 03:54:24 boots kernel: [66986.763626] RAID5 conf printout:
Jun  2 03:54:24 boots kernel: [66986.763628]  --- rd:4 wd:3
Jun  2 03:54:24 boots kernel: [66986.763630]  disk 1, o:1, dev:sdb3
Jun  2 03:54:24 boots kernel: [66986.763632]  disk 2, o:1, dev:sdc5
Jun  2 03:54:24 boots kernel: [66986.763634]  disk 3, o:1, dev:sdd3

Jun  2 06:59:33 boots kernel: [78097.000087] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun  2 06:59:34 boots kernel: [78097.000095] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jun  2 06:59:34 boots kernel: [78097.000096]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  2 06:59:34 boots kernel: [78097.000099] ata4.00: status: { DRDY }
Jun  2 06:59:34 boots kernel: [78097.000106] ata4: hard resetting link
Jun  2 06:59:34 boots kernel: [78097.484057] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun  2 06:59:35 boots kernel: [78097.493930] ata4.00: configured for UDMA/133
Jun  2 06:59:35 boots kernel: [78097.493941] end_request: I/O error, dev sdc, sector 488391944
Jun  2 06:59:35 boots kernel: [78097.493947] md: super_written gets error=-5, uptodate=0
Jun  2 06:59:35 boots kernel: [78097.493952] raid5: Disk failure on sdc7, disabling device.
Jun  2 06:59:35 boots kernel: [78097.493953] raid5: Operation continuing on 2 devices.
Jun  2 06:59:35 boots kernel: [78097.493967] ata4: EH complete
Jun  2 06:59:35 boots kernel: [78097.494105] sd 3:0:0:0: [sdc] 488397168 512-byte hardware sectors: (250 GB/232 GiB)
Jun  2 06:59:35 boots kernel: [78097.494124] sd 3:0:0:0: [sdc] Write Protect is off
Jun  2 06:59:35 boots kernel: [78097.494127] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jun  2 06:59:35 boots kernel: [78097.494156] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jun  2 06:59:35 boots mdadm[4109]: Fail event detected on md device /dev/md5, component device /dev/sdc7
Jun  2 06:59:35 boots kernel: [78097.635934] RAID5 conf printout:
Jun  2 06:59:35 boots kernel: [78097.635938]  --- rd:3 wd:2
Jun  2 06:59:35 boots kernel: [78097.635941]  disk 0, o:1, dev:sdb6
Jun  2 06:59:35 boots kernel: [78097.635944]  disk 1, o:0, dev:sdc7
Jun  2 06:59:35 boots kernel: [78097.635946]  disk 2, o:1, dev:sdd6
Jun  2 06:59:36 boots kernel: [78097.636143] RAID5 conf printout:
Jun  2 06:59:36 boots kernel: [78097.636146]  --- rd:3 wd:2
Jun  2 06:59:36 boots kernel: [78097.636148]  disk 0, o:1, dev:sdb6
Jun  2 06:59:36 boots kernel: [78097.636150]  disk 2, o:1, dev:sdd6

------------------------

/proc/mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4] 
md6 : active raid1 sdb7[0] sdd7[1]
      196290048 blocks [2/2] [UU]
      bitmap: 1/3 pages [4KB], 32768KB chunk

md5 : active raid5 sdc7[3] sdb6[0] sdd6[2]
      175815168 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
      [=================>...]  recovery = 89.3% (78552568/87907584) finish=4.6min speed=33323K/sec
      bitmap: 1/2 pages [4KB], 32768KB chunk

md4 : active raid5 sda8[0] sdd5[3] sdc6[2] sdb5[1]
      218636160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/2 pages [0KB], 32768KB chunk

md3 : active raid5 sda7[4] sdd3[3] sdc5[2] sdb3[1]
      218612160 blocks level 5, 64k chunk, algorithm 2 [4/3] [_UUU]
        resync=DELAYED
      bitmap: 2/2 pages [8KB], 32768KB chunk

md2 : active raid5 sda6[0] sdd2[3] sdc2[2] sdb2[1]
      30748032 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/1 pages [4KB], 32768KB chunk

md0 : active raid5 sda2[0] sdd1[2] sdc1[1]
      578048 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/1 pages [0KB], 32768KB chunk

md1 : active raid1 sdb1[0] sda5[1]
      289024 blocks [2/2] [UU]
      bitmap: 0/1 pages [0KB], 32768KB chunk

unused devices: <none>

------------------

/etc/mdadm/mdadm.conf:

# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR jdc@xxxxxx

# definitions of existing MD arrays
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=6b8b4567:327b23c6:643c9869:66334873
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ba493129:00074cd3:fee07e15:038135d5
ARRAY /dev/md2 level=raid5 num-devices=4 UUID=3dc9b50b:b9270472:9778d943:b967813b
ARRAY /dev/md3 level=raid5 num-devices=4 UUID=c4056d19:7b4bb550:44925b88:91d5bc8a
ARRAY /dev/md4 level=raid5 num-devices=4 UUID=d7c84402:210b78c7:556bbbc0:47df436c
ARRAY /dev/md5 level=raid5 num-devices=3 UUID=9effd43f:93ccc32d:899ca6c7:ea966964
ARRAY /dev/md6 level=raid1 num-devices=2 UUID=da17264f:be7e012d:85187211:fb0e2ebd

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
ÿôèº{.nÇ+?·?®??+%?Ëÿ±éÝ¶¥?wÿº{.nÇ+?·¥?{±þ¶¢wø§¶?¡Ü¨}©?²Æ zÚ&j:+v?¨þø¯ù®w¥þ?à2?Þ?¨èÚ&¢)ß¡«a¶Úÿÿûàz¿äz¹Þ?ú+?ù???Ý¢jÿ?wèþf