Re: raid1 issue after disk failure: both disks of the array are still active

Robin Hill <robin@xxxxxxxxxxxxxxx> · Sat, 15 Sep 2012 20:41:02 +0100

On Sat Sep 15, 2012 at 09:05:25 +0200, Niccolò Belli wrote:

> CHECK didn't help me, so I did a echo "repair > 
> /sys/block/md0/md/sync_action". REPAIR didn't work too :(
> 
Didn't work for what you were wanting anyway. It may well have worked
for its intended purpose.

> Here is syslog of REPAIR:
> 
> Sep 15 19:34:10 asterisk mdadm[2117]: RebuildStarted event detected on 
> md device /dev/md/0
> Sep 15 19:34:10 asterisk kernel: [258470.152296] md: requested-resync of 
> RAID array md0
> Sep 15 19:34:10 asterisk kernel: [258470.152301] md: minimum 
> _guaranteed_  speed: 1000 KB/sec/disk.
> Sep 15 19:34:10 asterisk kernel: [258470.152304] md: using maximum 
> available idle IO bandwidth (but not more than 200000 KB/sec) for 
> requested-resync.
> Sep 15 19:34:10 asterisk kernel: [258470.152310] md: using 128k window, 
> over a total of 311619448k.
> Sep 15 19:34:11 asterisk kernel: [258471.165653] ata3.00: exception 
> Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> Sep 15 19:34:11 asterisk kernel: [258471.167468] ata3.00: BMDMA stat 0x44
> Sep 15 19:34:11 asterisk kernel: [258471.169912] ata3.00: failed 
> command: READ DMA EXT
> Sep 15 19:34:11 asterisk kernel: [258471.172769] ata3.00: cmd 
> 25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
> Sep 15 19:34:11 asterisk kernel: [258471.172771]          res 
> 51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
> Sep 15 19:34:11 asterisk kernel: [258471.176753] ata3.00: status: { DRDY 
> ERR }
> Sep 15 19:34:11 asterisk kernel: [258471.178605] ata3.00: error: { UNC }
> Sep 15 19:34:12 asterisk kernel: [258472.148217] ata3.00: configured for 
> UDMA/133
> Sep 15 19:34:12 asterisk kernel: [258472.148232] ata3: EH complete
> Sep 15 19:34:13 asterisk kernel: [258473.131054] ata3.00: exception 
> Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> Sep 15 19:34:13 asterisk kernel: [258473.132881] ata3.00: BMDMA stat 0x44
> Sep 15 19:34:13 asterisk kernel: [258473.134639] ata3.00: failed 
> command: READ DMA EXT
> Sep 15 19:34:13 asterisk kernel: [258473.136413] ata3.00: cmd 
> 25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
> Sep 15 19:34:13 asterisk kernel: [258473.136415]          res 
> 51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
> Sep 15 19:34:13 asterisk kernel: [258473.141768] ata3.00: status: { DRDY 
> ERR }
> Sep 15 19:34:13 asterisk kernel: [258473.144049] ata3.00: error: { UNC }
> Sep 15 19:34:14 asterisk kernel: [258474.112209] ata3.00: configured for 
> UDMA/133
> Sep 15 19:34:14 asterisk kernel: [258474.112224] ata3: EH complete
> Sep 15 19:34:15 asterisk kernel: [258475.071642] ata3.00: exception 
> Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> Sep 15 19:34:15 asterisk kernel: [258475.073476] ata3.00: BMDMA stat 0x44
> Sep 15 19:34:15 asterisk kernel: [258475.075240] ata3.00: failed 
> command: READ DMA EXT
> Sep 15 19:34:15 asterisk kernel: [258475.077027] ata3.00: cmd 
> 25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
> Sep 15 19:34:15 asterisk kernel: [258475.077029]          res 
> 51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
> Sep 15 19:34:15 asterisk kernel: [258475.080720] ata3.00: status: { DRDY 
> ERR }
> Sep 15 19:34:15 asterisk kernel: [258475.083512] ata3.00: error: { UNC }
> Sep 15 19:34:16 asterisk kernel: [258476.100935] ata3.00: configured for 
> UDMA/133
> Sep 15 19:34:16 asterisk kernel: [258476.100960] ata3: EH complete
> Sep 15 19:41:29 asterisk asterisk[3492]: rc_avpair_new: unknown 
> attribute 1490026597
> Sep 15 19:41:46 asterisk asterisk[3492]: rc_avpair_new: unknown 
> attribute 1490026597
> Sep 15 19:41:52 asterisk asterisk[3492]: rc_avpair_new: unknown 
> attribute 1490026597
> Sep 15 19:42:52 asterisk asterisk[3492]: rc_avpair_new: unknown 
> attribute 1490026597
> Sep 15 19:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2 
> Currently unreadable (pending) sectors
> Sep 15 19:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline 
> uncorrectable sectors
> Sep 15 19:50:51 asterisk mdadm[2117]: Rebuild26 event detected on md 
> device /dev/md/0
> Sep 15 20:07:31 asterisk mdadm[2117]: Rebuild53 event detected on md 
> device /dev/md/0
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2 
> Currently unreadable (pending) sectors
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline 
> uncorrectable sectors
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 
> Temperature changed +4 Celsius to 42 Celsius (Min/Max 30/46)
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], SMART 
> Usage Attribute: 201 Soft_Read_Error_Rate changed from 99 to 100
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sdb [SAT], SMART 
> Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
> Sep 15 20:24:11 asterisk mdadm[2117]: Rebuild75 event detected on md 
> device /dev/md/0
> Sep 15 20:40:51 asterisk mdadm[2117]: Rebuild93 event detected on md 
> device /dev/md/0
> Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2 
> Currently unreadable (pending) sectors
> Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline 
> uncorrectable sectors
> Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], SMART 
> Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
> Sep 15 20:47:24 asterisk kernel: [262863.781068] md: md0: 
> requested-resync done.
> Sep 15 20:47:24 asterisk mdadm[2117]: RebuildFinished event detected on 
> md device /dev/md/0
> 
> 
Okay, so the drive logs an exception at 19:34:11, then completes its
error handling at 19:34:16.

If md hasn't failed the drive then either:
  - md didn't get a read error
  - md got a success message when re-writing the block
  - there's a bug in md and it's not handled the error at all

My guess would be on one of the first two (I'm not sure what's logged if
md gets a read error and does a re-write).

> 
> I still get:
> 
> Num  Test_Description    Status                  Remaining 
> LifeTime(hours)  LBA_of_first_error
> # 1  Offline             Completed: read failure       90%      8985 
>       3912
> 
> and
> 
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always 
>        -       2
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age 
> Offline      -       1
> 
> 
> How is it possible? Next thing I will try is manually failing /dev/sda 
> and filling it with zeros. I would like to do a *low level format* but I 
> didn't find the utility for my disk :(
> 
I'm pretty sure there's no such thing as a *low level format* for any
modern disk (or not one that does anything more than writing a known
pattern to the disk). The low-level information is far too precisely
laid out for the disk heads to be able to write.

Writing zeros is certainly what I'd do in this situation - I've done it
for several drives in the past where they've had offline uncorrectable
sectors flagged.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
pgpaaULRDUsRS.pgp

Description: PGP signature