Re: Likely forced assemby with wrong disk during raid5 grow. Recoverable?

Claude Nobs <claudenobs@xxxxxxxxx> · Thu, 24 Feb 2011 05:06:08 +0100

On Wed, Feb 23, 2011 at 02:53, NeilBrown <neilb@xxxxxxx> wrote:
> No - just the things you suggest.
> The Reshape pos'n is the address in the array where reshape was up to.
> You could try using 'debugfs' to have a look at the context of those blocks.
> Remember to divide this number by 4 to get an ext4fs block number (assuming
> 4K blocks).
>
> Use: Â testb BLOCKNUMBER COUNT
>
> to see if the blocks were even allocated.
> Then
> Â Â Â icheck BLOCKNUM
> on a few of the blocks to see what inode was using them.
> Then
> Â Â Â ncheck INODE
> to find a path to that inode number.
>
>
> Feel free to report your results - particularly if you find anything helpful.

So... the reshape went through fine... /dev/md1 failed once more but
doing the same thing over seemed to work fine. i then instantly went
on to resync the array. this however did not go so well... it failed
twice at the exact same point (/dev/m1 failing again)... looking at
dmesg i got repeated :

[66289.326235] ata2.00: exception Emask 0x0 SAct 0x1fe1ff SErr 0x0 action 0x0
[66289.326247] ata2.00: irq_stat 0x40000008
[66289.326257] ata2.00: failed command: READ FPDMA QUEUED
[66289.326273] ata2.00: cmd 60/20:a0:20:64:5c/00:00:07:00:00/40 tag 20
ncq 16384 in
[66289.326276]          res 41/40:00:36:64:5c/00:00:07:00:00/40 Emask
0x409 (media error) <F>
[66289.326284] ata2.00: status: { DRDY ERR }
[66289.326290] ata2.00: error: { UNC }
[66289.334377] ata2.00: configured for UDMA/133
[66289.334478] sd 2:0:0:0: [sdf] Unhandled sense code
[66289.334486] sd 2:0:0:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[66289.334499] sd 2:0:0:0: [sdf] Sense Key : Medium Error [current] [descriptor]
[66289.334515] Descriptor sense data with sense descriptors (in hex):
[66289.334522]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[66289.334552]         07 5c 64 36
[66289.334566] sd 2:0:0:0: [sdf] Add. Sense: Unrecovered read error -
auto reallocate failed
[66289.334582] sd 2:0:0:0: [sdf] CDB: Read(10): 28 00 07 5c 64 20 00 00 20 00
[66289.334611] end_request: I/O error, dev sdf, sector 123495478

and smartctl data confirmed a dying /dev/sdf (part of /dev/md1) :

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
Always       -       10
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       2

did some further digging and copied (dd) the whole /dev/md1 to another
disk (/dev/sdd1). unearthing a total of 5 unrecoverable 4K blocks. if
only i had gone with the less secure non-degraded option you gave me.
:-)
however assembly with the copied disk fails :

bernstein@server:~$ sudo mdadm/mdadm -Avv /dev/md2 /dev/sda1 /dev/md0
/dev/sdd1 /dev/sdc1

mdadm: looking for devices for /dev/md2
mdadm: /dev/sda1 is identified as a member of /dev/md2, slot 4.
mdadm: /dev/md0 is identified as a member of /dev/md2, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md2, slot 2.

mdadm: /dev/sdc1 is identified as a member of /dev/md2, slot 0.
mdadm: no uptodate device for slot 1 of /dev/md2
mdadm: failed to add /dev/sdd1 to /dev/md2: Invalid argument
mdadm: added /dev/md0 to /dev/md2 as 3
mdadm: added /dev/sda1 to /dev/md2 as 4
mdadm: added /dev/sdc1 to /dev/md2 as 0

mdadm: /dev/md2 assembled from 3 drives - not enough to start the array.

and dmesg shows :

[22728.265365] md: md2 stopped.
[22728.271142] md: sdd1 does not have a valid v1.2 superblock, not importing!
[22728.271167] md: md_import_device returned -22
[22728.271524] md: bind<md0>
[22728.271854] md: bind<sda1>
[22728.272135] md: bind<sdc1>
[22728.295812] md: sdd1 does not have a valid v1.2 superblock, not importing!
[22728.295838] md: md_import_device returned -22

but mdadm --examine /dev/md1 /dev/sdd1 outputs exactly the same
superblock information for both devices (and apart from device uuid,
checksum, array slot, array state its identical to sdc1 & sda1) :

/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0

     Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
           Name : master:public
  Creation Time : Sat Jan 22 00:15:43 2011
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)
     Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
  Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3c7e2c3f:8b6c7c43:a0ce7e33:ad680bed

    Update Time : Wed Feb 23 19:34:36 2011
       Checksum : 2132964 - correct
         Events : 137715

         Layout : left-symmetric
     Chunk Size : 64K

    Array Slot : 3 (0, 1, failed, 2, 3, 4)
   Array State : uuUuu 1 failed

does it fail because the device size of /dev/sdd1 & /dev/md1 differs
(normally reflected in the superblock) :
/dev/sdd1:

 Avail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
/dev/md1:

 Avail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)

or any other idea why it complains about an incorrect superblock?

i really hoped that cloning the defective device would get me back in
the game (guessing this is completely transparent to md and the
defective blocks will only corrupt the filesystem blocks and don't
interfere with md operation) but at this point it seems that restoring
from backup might be faster still.

thanks
claude

@neil sorry about the multiple messages...
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html