Re: Unable to re-add a disk after a reboot.

NeilBrown <neilb@xxxxxxx> · Fri, 15 Aug 2014 10:19:24 +1000

On Thu, 14 Aug 2014 18:08:30 -0500 Ram Ramesh <rramesh2400@xxxxxxxxx> wrote:

> Hi,
> 
>    I just finished converting a 3-disk raid5 to 4-disk raid6. After a 
> reboot to start clean, I noticed that one of the disk (the new one I 
> just added) was missing in /proc/partitions. This was disk 4 in my 
> /dev/md0. Assuming some cable issue, I powered off, wiggled the cables 
> and restarted and the device was found by kernel. However, md0 shows 
> device missing and array degraded
> 
>     lata [rramesh] 280 > cat /proc/mdstat
>     Personalities : [raid6] [raid5] [raid4]
>     md0 : active raid6 sdb1[0] sdd1[3] sdc1[1]
>            3906763776 blocks super 1.2 level 6, 512k chunk, algorithm 2
>     [4/3] [UUU_]
> 
>     unused devices: <none>
> 
> However my attempt to --re-add does not work.
> 
>     lata [rramesh] 277 > sudo mdadm /dev/md0 --verbose --re-add /dev/sde1
>     mdadm: --re-add for /dev/sde1 to /dev/md0 is not possible

"re-add" only makes sense when you have a write-indent bitmap which you don't
have.
So you need to "--add" which marks the device as a spare and then starts a
complete rebuild.

> I checked the SMART and it shows a lot of reallocated_sector_ct errors 
> also. So, the disk is dying, but I am not able understand why mdadm 
> would not add.

It will "add".  It just wont "re-add".

NeilBrown

> 
>     SMART Attributes Data Structure revision number: 16
>     Vendor Specific SMART Attributes with Thresholds:
>     ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
>     UPDATED  WHEN_FAILED RAW_VALUE
>        1 Raw_Read_Error_Rate     0x000b   091   091   016 Pre-fail 
>     Always       -       53
>        2 Throughput_Performance  0x0005   100   100   054 Pre-fail 
>     Offline      -       0
>        3 Spin_Up_Time            0x0007   135   135   024 Pre-fail 
>     Always       -       426 (Average 425)
>        4 Start_Stop_Count        0x0012   100   100   000 Old_age  
>     Always       -       59
>     *5 Reallocated_Sector_Ct   0x0033   001   001   005 Pre-fail 
>     Always   FAILING_NOW 330*
>        7 Seek_Error_Rate         0x000b   098   098   067 Pre-fail 
>     Always       -       2
>        8 Seek_Time_Performance   0x0005   100   100   020 Pre-fail 
>     Offline      -       0
>        9 Power_On_Hours          0x0012   100   100   000 Old_age  
>     Always       -       3445
>       10 Spin_Retry_Count        0x0013   100   100   060 Pre-fail 
>     Always       -       0
>       12 Power_Cycle_Count       0x0032   100   100   000 Old_age  
>     Always       -       59
>     192 Power-Off_Retract_Count 0x0032   100   100   000 Old_age  
>     Always       -       548
>     193 Load_Cycle_Count        0x0012   100   100   000 Old_age  
>     Always       -       548
>     194 Temperature_Celsius     0x0002   153   153   000 Old_age  
>     Always       -       39 (Min/Max 21/43)
>     196 Reallocated_Event_Count 0x0032   001   001   000 Old_age  
>     Always       -       17604
>     197 Current_Pending_Sector  0x0022   001   001   000 Old_age  
>     Always       -       13256
>     198 Offline_Uncorrectable   0x0008   100   100   000 Old_age  
>     Offline      -       0
>     199 UDMA_CRC_Error_Count    0x000a   200   200   000 Old_age  
>     Always       -       0
> 
> Any recommendations while I am waiting to get a replacement.
> 
> Ramesh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment:
signature.asc

Description: PGP signature