On Wednesday 06 January 2010 16:13:21 David C. Rankin wrote: > Listmates (Tobias) > > I have a server that has 2 dmraid arrays (4 drives -> 2 arrays) that has > been bullet-proof for years. A month ago (either coincidentally or due to > a bug in the suse 11.2 kernel for client ssh/sftp sessions) I began > experiencing sda errors on the array comprised of sda/sdc drives. The > errors took the form of: > > Dec 5 20:48:48 nirvana sshd[30922]: error: ssh_msg_send: write > Dec 5 20:49:10 nirvana sshd[30965]: Accepted keyboard-interactive/pam for > legaleagle from 192.168.6.102 port 36 663 ssh2 > Dec 5 20:50:12 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x0 Dec 5 20:50:12 nirvana kernel: ata3.00: BMDMA stat 0x25 > Dec 5 20:50:12 nirvana kernel: ata3.00: cmd > 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5 > 20:50:12 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 > Emask 0x9 (media error) Dec 5 20:50:12 nirvana kernel: ata3.00: > configured for UDMA/133 > Dec 5 20:50:12 nirvana kernel: ata3: EH complete > Dec 5 20:50:14 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x0 Dec 5 20:50:14 nirvana kernel: ata3.00: BMDMA stat 0x25 > Dec 5 20:50:14 nirvana kernel: ata3.00: cmd > 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5 > 20:50:14 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 > Emask 0x9 (media error) Dec 5 20:50:14 nirvana kernel: ata3.00: > configured for UDMA/133 > Dec 5 20:50:14 nirvana kernel: ata3: EH complete > Dec 5 20:50:16 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x0 Dec 5 20:50:16 nirvana kernel: ata3.00: BMDMA stat 0x25 > Dec 5 20:50:16 nirvana kernel: ata3.00: cmd > 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5 > 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 > Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00: > configured for UDMA/133 > Dec 5 20:50:23 nirvana kernel: ata3: EH complete > Dec 5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x0 Dec 5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25 > Dec 5 20:50:23 nirvana kernel: ata3.00: cmd > 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5 > 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 > Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00: > configured for UDMA/133 > Dec 5 20:50:23 nirvana kernel: ata3: EH complete > Dec 5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x0 Dec 5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25 > Dec 5 20:50:23 nirvana kernel: ata3.00: cmd > 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5 > 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 > Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00: > configured for UDMA/133 > Dec 5 20:50:23 nirvana kernel: ata3: EH complete > Dec 5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x0 Dec 5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25 > Dec 5 20:50:23 nirvana kernel: ata3.00: cmd > 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5 > 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 > Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00: > configured for UDMA/133 > Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE,SUGGEST_OK Dec 5 20:50:23 nirvana kernel: sd > 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Dec 5 > 20:50:23 nirvana kernel: Descriptor sense data with sense descriptors (in > hex): Dec 5 20:50:23 nirvana kernel: 72 03 11 04 00 00 00 0c 00 > 0a 80 00 00 00 00 00 Dec 5 20:50:23 nirvana kernel: 34 8c 0c 39 > Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered > read error - auto reallocate failed Dec 5 20:50:23 nirvana kernel: > end_request: I/O error, dev sda, sector 881593401 Dec 5 20:50:23 nirvana > kernel: ata3: EH complete > Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] 976773168 512-byte > hardware sectors (500108 MB) Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: > [sda] Write Protect is off Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: > [sda] Mode Sense: 00 3a 00 00 Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: > [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or > FUA > > Booting from the install disk, assembling the arrays and fsck'ing -c -y the > individual disks found a bad block on sda, corrected the problem and > things were fine for a month. Then the same sda error appeared. > > Currently I have disabled dmraid and have the two disks running > independently with zero errors. I'm keeping the contents mirrored with a > cron job that basically does a 'cp -a / /mnt/sda' where sdc is the disk > that is booted from and sda is mounted at /mnt/sda. > > This experience has raised a question that I can't find the answer to: > > "How does dmraid handle a bad block?" > > Also (hardware issues aside), is there anything from a kernel/drive > controller standpoint that could be invoking the error? > > After running the disks independently for a week without a single error, I > think I'll reassemble the array and rebuild it. Any other > thoughts/suggestions? Obviously there was a bad block that developed on > sda, but adding it to the bad block table fixed it. I have a pair of 750 G > drives coming to replace this set, but I'm curious about the bad block > handling with dmraid. If 1 bad block is enough to kill the array, then > that's not a very robust system. > > The disks are both seagate 500G drives (ST3500641AS). The system hardware > is: > > > System Information > Manufacturer: TYAN Computer Corp > Product Name: S2865 > > BIOS Information > Vendor: Phoenix Technologies, LTD > Version: 6.00 PG > Release Date: 06/20/2005 > Address: 0xE0000 > Runtime Size: 128 kB > ROM Size: 512 kB > Characteristics: > > Processor Information > Socket Designation: Socket 939 > Type: Central Processor > Family: Athlon 64 > Manufacturer: AMD > ID: 32 0F 02 00 FF FB 8B 17 > Signature: Family 15, Model 35, Stepping 2 > > All thoughts welcomed... > I would run the following on the drive /usr/sbin/smartctl -a /dev/sda That will tell you about the stats on the drive. Can't help with the badblock questions as I use mdadm with 3 drives raid5 array.